Project Description

We address the challenge [1] by DoD DTRA agency in identifying organisms from a stream of DNA sequences.

A key application is in surveillance of bacterial populations in clinical setting in order to address the global threat of antibiotic resistance [2].

Our framework would allow accurate close to real-time (minutes) analyses of DNA sequence samples as opposed to current tools which take days to run. It would also speed up bacteriology, virology, immunology and environmental metagenomic studies such as [3] in particular which took >500000 hours to execute using existing tools.

Using a novel algorithm (log-linear tagger) from NLP domain allows to capture richer structural patterns otherwise missed by existing HMM-based tools. Supercomputing resources would also allow us to cover a large number of features and run multiple tests to fine-tune the model parameters.

Some of the artifacts produced in this project, such as machine learning framework for sequence tagging will be available for general use and can serve future projects both in bioinformatics and in other domains such as time-series analysis, natural language processing, or image processing

The framework we develop will serve as a back-end of a search engine accessible via Web and Rest-API interfaces. We intend to make DNA searches and analyses as easy as Google searches.

Allocation History

Source Hours Start Date End Date
OLCF DIRECTOR'S DISCRETIONARY PROGRAM2,000,0002014-05-222014-09-30
OLCF DIRECTOR'S DISCRETIONARY PROGRAM2,000,0002014-05-222014-09-30