Predicting Protein Targets and the Molecules that Bind with Them

When the COVID-19 pandemic brought the world to a standstill in 2020, pharmaceutical companies countered with remarkable speed to develop mRNA vaccines by the end of that year. Creating new antiviral drugs to treat those who became infected by the deadly coronavirus took about another year of work—astoundingly quick compared to the industry’s average timeframe of 14 years. That rapid response may not be the norm in future outbreaks. But what if the drug-development process could be greatly accelerated with a new AI methodology?

A team of computational scientists at the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory (ORNL) has designed a machine-learning software stack that predicts how strongly a given drug molecule will bind to a pathogen as well as the 3D structure of how it will attach to the target. These vital pieces of information can greatly shorten the usual trial-and-error process of lab experimentation to find viable drug candidates, especially for novel viruses like COVID-19 with yet unknown structures.

“Together, these two representations—binding strength and structure—give pharmaceutical researchers more information to decide whether a molecule is worth pursuing as a drug candidate. This will reduce the amount of time, effort, and money spent on the experimental side,” said Jens Glaser, a computational scientist in materials at the National Center for Computational Sciences (NCCS) and the project’s team leader.

The software, TwoFold, has been named a finalist for the 2022 Gordon Bell Special Prize for HPC-Based COVID-19 Research by the Association for Computing Machinery. The awards will be presented at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22) on November 15–17, 2022, in Dallas, Texas.

The TwoFold team members with Frontier, from left: Feiyi Wang, Darren Hsu, Aditya Kashi, John Gounley, Michael Matheson, Jens Glaser, and Hao Lu. Photo: Genevieve Martin/ORNL.

TwoFold was inspired by the ORNL team’s previous efforts during the pandemic’s onset to identify candidates for new therapeutics in the fight against COVID-19. Conducting virtual screenings on ORNL’s Summit supercomputer, they used AI to help identify molecular candidates that could more efficiently bind with the virus’s proteins and potentially disrupt them. But they found that this methodology alone wasn’t informative enough to truly narrow the field of drug options to its most promising candidates.

“It’s not sufficient to only predict the single number that characterizes how strongly a molecule will bind against its viral target, because 90% of drugs fail in clinical trials. Even with in vitro experiments, it’s realistic to expect that only 1% of compounds are active against the target,” Glaser said. “So, we asked ourselves, what is holding us back from routinely producing good predictions? And we found that the big factor is a lack of efficient, reliable, and accurate structure prediction for novel, unseen targets.”

Without a map of a virus’s structure and its possible protein targets—which is a very likely shortcoming in the case of a novel coronavirus disease like COVID-19—researchers face a slow, hit-or-miss effort to identify drug candidates. Protein structures are usually mapped through the process of crystallization, in which a protein is dissolved in an aqueous solution until it reaches a supersaturated state and its molecules align themselves in a crystalline lattice. The resulting pattern can then be analyzed to determine the protein’s 3D structure. It’s a laborious process that may take anywhere from a month to a year or more. That means any methodology that requires protein structures to effectively identify drug candidates will be on hold until the crystallization is completed and analyzed. Not so with TwoFold.

“In our case, once we have the protein’s amino acid sequence—which can be easily obtained way before the crystallization that allows you to determine the target structures—you can already start the drug development process,” said team member Darren Hsu, a postdoctoral research associate at the NCCS. “By simply inputting the target sequence, TwoFold gives you a head start—it shortens your time to solution for proposing possible candidates.”

TwoFold provides two predictions at the same time: protein folding and molecule (or ligand) folding (hence its name). Folding is the process in which protein chains take on their conformation, or 3D arrangement of atoms, to assume their biological functions. This allows TwoFold to predict the binding strength of a given drug molecule and provide spatial and visual representations of how that molecule engages its target. Thus, scientists needn’t wait for the virus’s protein structure to be revealed in the lab before searching out candidates.

The ORNL team achieved this predictive twofer by devising a new methodology that computes its results in two steps conducted on two supercomputers: Frontier and Summit, both managed by the Oak Ridge Leadership Computing Facility, a DOE Office of Science user facility at ORNL.

First, a neural network—an AI component trained to recognize patterns in datasets—was trained by the team on Summit with experimental data for drug-target interactions and receptor-bound structures. The workflow included a neural tangent kernel (NTK), which was used to provide further insights into the deep learning model.

“The NTK is useful to describe a neural net with infinitely many neurons. Real neural nets have a finite set of neurons. Understanding the NTK allows us to better understand the properties of finite networks by allowing us to focus our attention on the impact of having a finite number of paths,” Glaser said. “It can also be used to make predictions that can be compared with standard deep learning models, which is what we did for protein–ligand binding with TwoFold.”

Second, the deep learning model produced a matrix of millions of equations that must be solved on the even more powerful, exascale-class Frontier supercomputer. The team found a solver program in an unlikely source: the supercomputer benchmarking software HPL-AI.

“Because the matrix is so large, we needed the OLCF supercomputers to solve that efficiently. And it turns out that the same problem is used for measuring the performance of these supercomputers with HPL-AI—plus, the benchmark can exploit the supercomputer to its fullest because it’s meant to measure the performance,” said team member Aditya Kashi, an NCCS computer scientist in scalable numerical methods. “But this benchmark code solves a very specific version of the problem, so we worked on generalizing the benchmark to be able to solve a real problem, the solution for which we could actually use while still getting the advantages of the benchmark’s efficiency.”

Although using two different supercomputers presented some I/O data challenges, the team believes that once they build automation features into TwoFold, this two-step computation process could become a routine workflow. Furthermore, with its unprecedented speed and accuracy, TwoFold could help make supercomputers an integral tool in drug discovery, thereby greatly accelerating the process.

“The solution of this roughly 1.1 million-row matrix was blazingly fast on Frontier in double-precision—an unprecedented time for a machine learning problem. It typically takes weeks to train an AI,” Glaser said. “But solving the neural tangent kernel on Frontier with 128 nodes took 78 seconds. That hints at the dramatic speedups that more parallel models could achieve on high-performance computing systems.”

The team’s project was also motivated by advances in language models achieved under CANDLE, one of the Exascale Computing Project’s applications. MOSSAIC, a CANDLE subproject, targets state-of-the-art language model solutions to enable near-real-time cancer surveillance.

“It is impressive to observe how methodological advances in one application domain can inspire and spearhead advances in other applications. It is the power of cross-pollination of ideas between domains and individuals,” said Gina Tourassi, director of the NCCS.

UT-Battelle LLC manages ORNL for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.