- Prediction of Protein
Coding Regions in DNA sequences Using Fourier Spectral Characteristics
with Professor Amir Asif

accepted for publication in ICASSP 2005 (pdf file). Also invited at Asilomar 2004. An earlier version (with Haoyuan Wang) was published in the bioinformatics track at MSE-2004. - New Clustering Algorithms for Microarray Data (work in progress)

ABSTRACT :
Microarrays are a recent technology for studying in a single experiment
how many genes express in a wide variety of conditions. Microarrays can be
used to infer the functions of a gene and to predict the roles of a
gene in causing a disease. A large amount of gene expression data has become
available from microarray experiments in recent years. However, finding patterns
in the data produced in microarray experiments is a challenging task, both
due to its large volume and due to the lack of prior knowledge of the patterns
being searched for. I have started working on an important subproblem --
cluster analysis of microarray data. Our approach differs from the large
body of existing work in that we use signal processing techniques to detect
important changes in the microarray data and use the locations of these changes
to do the clustering. A similar idea was investigated in [Syeda]; our approach
differs from this paper in that we use a more general notion of changes and
our clustering algorithm is hierarchical.

ABSTRACT :

The exponential growth in the sequencing of genomes has fueled interest in fast, automated decoding of inherent genomic information from these sequences. The identification of protein coding regions in DNA sequences is a fundamental step in computational recognition of genes. Traditional Discrete Fourier transform (DFT) based approaches [Anastassiou01b,TiwariRBBR97] exploit the empirical observation that the spectrum of protein coding DNA of length $N$ nucleotides typically has a peak at frequency k = N/3

corresponding to a periodicity of 3, which is the length of a DNA codon. We prove the aforementioned and several other empirical observations attributed to DNA sequences. In addition, we propose an algorithm that predicts genes based on short-term Fourier spectra. The recall and accuracy of the algorithm are measured by running the algorithm over a large dataset. Our algorithm improves on existing algorithms in several ways. We show that it is computationally more efficient than existing algorithms based on short-term Fourier Transforms [Anastassiou01b,TiwariRBBR97]. Being deterministic, our algorithm produces an unique answer as opposed to the randomized algorithms in the Genomic Signal Processing literature. We motivate and demonstrate the advantages of the use of windows other than the rectangular ones used so far in the literature.

REFERENCES:

[Anastassiou01b] D.~Anastassiou. Genomic signal processing. IEEE Signal
Processing Magazine, July 2001.

[TiwariRBBR97] S. Tiwari and S. Ramachandran and A. Bhattacharya and S.
Bhattacharya and R. Ramaswamy.

Prediction of probable genes by Fourier analysis of genomic sequences. Computer
Applications in Biosciences,

volume 13, pages 263-270, 1997.

[Syeda] Tanveer Syeda-Mahmood. Clustering Time-varying Gene Expresssion Profiles Using Scale-space Signals. Proceedings of the Computational Systems Bioinformatics (CSB), 2003.

Note: I am often asked about how much Biology and Chemistry students need to know to do a project in this area. For 4080 students, I will (most likely) learn the Biological aspects and allow the student to focus on only the computational aspects (of course a student is welcome to learn the Biological aspects). M.Sc. students working in this area do need to know a little bit more, but that can be picked up by self-study. There are no graduate courses suitable for such students but we are working towards designing one.

- There is a very brief introduction here and a more detailed introduction here.
- There are several articles on Genomics on this page.
- A course page with lecture notes and links to some nice papers.
- Here is a list of On-line bioinformatics courses and tutorials.
- For an introduction to Genomic Signal Processing, look at this page. You should be able to download the article from any computer at York.
- There are some good lecture notes at this page.
- For microarrays, visit this page for a nice introduction, and do not worry if you do not understand all the Biology. The problem I am interested in is to design algorithms for analyzing the data. This page has a brief description of some the data mining/analysis problems involved.