Project Description

Proteogenomics measurements reveal the genomic sequences and enzyme inventories of microbial communities. This information provides valuable insights into biological processes in microbial ecosystems and their responses to different environmental perturbations. The increased throughput in sequencing and mass spectrometry technologies enabled the generation of big proteogenomics data to achieve more comprehensive coverage of complex microbial communities across a large number of environmental conditions. The raw sequencing and mass spectrometry data from a proteogenomics study needs to be processed through a suite of computational analyses to yield meaningful results for biologists. These computational analyses are computing-intensive for big proteogenomics data. To overcome this challenge, we have developed scalable algorithms to run key data analytics on the Titan supercomputer. The computational readiness and the scientific impact of our large-scale computations on Titan have demonstrated in two previous ALCC awards and the scientific publications produced using the awarded Titan allocations. Here we propose to use our scalable proteogenomics data analytics capabilities and the Titan and Summit supercomputer for studies of plant rhizosphere communities, tropical soil communities, and human gut microbiota. In the study of plant rhizosphere communities, we aim to identify the microorganisms that uptake carbon from plant root exudates and the active metabolic pathways in these microorganisms using proteomic stable isotope probing. In the study of tropical soil communities, we will characterize the responses of soil communities in a tropical rainforest to different levels of nitrogen and phosphorus availability. In the study of human gut microbiota, we will compare the microbiota from lean African Americans and those from obese African Americans to identify microbial functions potentially contributing to obesis. In all these studies, we will work with collaborators to perform deep proteogenomics analyses of microbial communities and analyze the acquired large datasets using scalable algorithms on Titan and Summit. Specifically, the Sipros algorithm will be used to identify proteins expressed by microbial communities from mass spectrometry data. The Disco algorithm will be used to assemble genomes of microorganisms from Illumina shotgun sequencing of community metagenomes. The Sigma algorithm will be used to quantify the abundances of genomic sequences in metagenomes and transcripts in metatranscriptomes. The computational scalabilities of these algorithms have been benchmarked on Titan. The requested allocations on Titan will be essential to accomplish big data analytics in these large-scale proteogenomics studies. The obtained results will provide critical knowledge of the microbial communities that impact plant productivity, soil nutrient cycling, and human health. Petascale analytics of big proteogenomics data on key microbial communities.

Allocation History

Source Hours Start Date End Date