Sequence analysis for DNA storage and genomics

 Emerging Applications of Algebraic and Combinatorial Methods
  • DNA Storage and Phylogenomics.  

 

Overview: Data reconstruction from traces is an important problem with numerous applications, including data recovery, genomics and other areas of biology, chemistry, sensor networks, associative memories, and many others. The general problem of reconstruction is broadly divided into probabilistic and adversarial variants. In the probabilistic version, the traces are formed by passing the data through a noisy channel (typically an edit channel with certain deletion and insertion probabilities) and the goal is to reconstruct the data to within a certain error probability.

Specifically, consider the problem of passing a sequence through a noisy channel multiple times, and trying to reconstruct the sequence from the set of distinct channel outputs. It is then of practical interest to know how many distinct channel outputs are required to reconstruct the sequence in the worst case, and furthermore, what is the most efficient algorithm to reconstruct.

For example, one goal in Phylogenomics is to recover genetic information of a deceased individual from the genetic information of its descendants. Here, the genetic information of each descendant can be viewed as a transmission of the ancestor’s genetic information through a noisy channel. The same questions can be asked: How many descendants are required to recover the ancestor’s information? How can we efficiently reconstruct the ancestor’s information?

Recent Results: We recently found a closed form for the number of distinct insertion-channel outputs required to reconstruct, when the possible original sequences are separated by some minimum edit distance. This is useful for insertion correcting-codes where all codewords are separated by some edit distance.   It is also relevant to Phylogenomics because potential ancestors likely have genetic information that is separated by some non-trivial edit distance. We have several other novel results in submission and in preparation. Current research also involves specializing our framework to DNA storage.

Recent Publications

  • F. Sala, R. Gabrys, C. Schoeny, L. Dolecek, “Exact Reconstruction from Insertions in Synchronization Codes,” Submitted to IEEE Transactions on Information Theory, 2016
  • K. Mazooji, F. Sala, C. Schoeny, L. Dolecek, “On Reconstruction from Variable-Length Traces with Applications to Phylogenomics,”  Submitted 2016.
  • F. Sala, R. Gabrys, C. Schoeny, K. Mazooji, L. Dolecek, “Exact Sequence Reconstruction for Insertion-Correcting Codes,” International Symposium on Information Theory (ISIT), 2016.
  • F. Sala, C, Schoeny, and L. Dolecek, “Three Novel Combinatorial Theorems for the Insertion/Deletion Channel,” in Proc. IEEE International Symposium on Information Theory (ISIT), Hong Kong, June, 2015