Pattern Discovery and Computational Biology

Given a string, a pattern is a repeating substring possibly with some “don’t care” characters. The problem of pattern discovery is that of finding all such substrings that appear at least \(k\) times. Discovering such patterns in nucleic and amino acid sequences is of great interest in computational biology. If we do not have a model for evaluating the significance of the patterns a priori, it is important that all the patterns be detected. However, the number of such patterns could be exponential in the size of the input. In this talk, I will describe a notion of redundancy which gives rise to a provably small number of irredundant patterns, while preserving the information content of the set of all patterns. I will further discuss how this helps in designing a very efficient pattern discovery algorithm.

There are several generalizations of the pattern discovery problem which are important in computational biology and we will discuss one such generalization. We will conclude the talk by presenting several concrete examples, of the application of pattern discovery in the area of biology.