Motif discovery among set biological sequences isimportant and active research area in computational biology. For analysis ofsequence data, motif search problem incorporate various important problems,where biologically important patterns are known as a motif. For example,analyzing large-scale genomic and proteomic data is one of the challenges inorder to discover motif. A motif is conserved amino acid sequence pattern whichis present in most of the proteins of that protein-family and is thought to bebiologically significant for those proteins in exhibiting their structure orfunction. These conserved regions often serve either structural support to theprotein, or to serve as functionally important parts of the protein. Hence tobetter understand the tertiary structure and to predict the function of thatprotein, it is essential to discover such motif.
To discover motif there isvarious algorithm exist such as AlignACE 1, Weeder 2 (which are used todiscover DNA motifs), Gibbs 3 and MEME 4 (which are used to discover motifin both protein and DNA dataset).Initially,motif discovery is a more complex process. In this approach, X-ray structuralstudy of protein is carried out with similar function, which is a goodindicator of the binding site and, hence, the amino acid residues forming thebinding site are considered as the motif which is responsible for function. Alist of such known pattern has been compiled into PROSITE 5 database. PROSITEalso have a program which matches these patterns against sequences, so we candirectly use the primary sequence to extract the pattern. If a new sequenceconsist a known pattern it is a good indicator of possible function. Thepattern in PROSITE is not automated but by inspection.
However, the rates atwhich new sequences are being determined there is a need for an automaticmethod to extract the pattern from primary sequence information.Thetraditional approach of motif discovery is based on multiple sequencealignment. In this approach to construct the consensus pattern, a region isdiscovered which is greater than average similarity from the aligned sequence.However, the multiple sequence alignment is best for limited sets of relatedprotein because they are sensitive to gap penalty parameters and similarityscoring matrix.
Anotherapproach to the problem is to use statistical technique to discover biologicallymeaningful patterns and relationships.