Motif discovery among set biological sequences is
important and active research area in computational biology. For analysis of
sequence data, motif search problem incorporate various important problems,
where biologically important patterns are known as a motif. For example,
analyzing large-scale genomic and proteomic data is one of the challenges in
order to discover motif. A motif is conserved amino acid sequence pattern which
is present in most of the proteins of that protein-family and is thought to be
biologically significant for those proteins in exhibiting their structure or
function. These conserved regions often serve either structural support to the
protein, or to serve as functionally important parts of the protein. Hence to
better understand the tertiary structure and to predict the function of that
protein, it is essential to discover such motif. To discover motif there is
various algorithm exist such as AlignACE 1, Weeder 2 (which are used to
discover DNA motifs), Gibbs 3 and MEME 4 (which are used to discover motif
in both protein and DNA dataset).
motif discovery is a more complex process. In this approach, X-ray structural
study of protein is carried out with similar function, which is a good
indicator of the binding site and, hence, the amino acid residues forming the
binding site are considered as the motif which is responsible for function. A
list of such known pattern has been compiled into PROSITE 5 database. PROSITE
also have a program which matches these patterns against sequences, so we can
directly use the primary sequence to extract the pattern. If a new sequence
consist a known pattern it is a good indicator of possible function. The
pattern in PROSITE is not automated but by inspection. However, the rates at
which new sequences are being determined there is a need for an automatic
method to extract the pattern from primary sequence information.
traditional approach of motif discovery is based on multiple sequence
alignment. In this approach to construct the consensus pattern, a region is
discovered which is greater than average similarity from the aligned sequence.
However, the multiple sequence alignment is best for limited sets of related
protein because they are sensitive to gap penalty parameters and similarity
approach to the problem is to use statistical technique to discover biologically
meaningful patterns and relationships.