Autocoding Cancer

An algorithm developed by the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory (ORNL) in partnership with the National Cancer Institute (NCI) is speeding up classification, or coding, of cancer pathology reports. The early results from the Cancer Moonshot program show that the algorithm dramatically reduces the time it takes for national cancer registries to process reports and in turn the time it takes for data to become available for public health monitoring and policy making.

The new Case-Level Multi-Task Hierarchical Self Attention Network uses natural language processing (NLP) to autocode cancer pathology reports that are submitted to Surveillance, Epidemiology and End Results (SEER) registries across the United States. The system is designed to classify reports based on data used for population-level surveillance of cancer, including cancer type, histology, laterality, and behavior. The network’s development is supported through a collaboration between ORNL and NCI known as MOSSAIC (Modeling Outcomes using Surveillance data and Scalable Artificial Intelligence for Cancer).

Posed portrait of person in grey jacket and red shirt

Heidi Hanson, group leader for the Biostatistics and Multiscale Systems Group, co-leads research under MOSSAIC at ORNL.

“The partnership between DOE and [NCI] is critical because both entities bring skills and resources to the table that, when combined, allow for innovative solutions that bring us closer to near–real time population cancer surveillance. The compute and security resources we have at ORNL—paired with the domain knowledge and population-level data they have at NCI—result in a unique and valuable system for measuring population health,” said Heidi Hanson, group leader for the lab’s Biostatistics and Multiscale Systems Group who co-leads the research with principal investigator Georgia Tourassi. Tourassi is director of the National Center for Computational Sciences and the Oak Ridge Leadership Computing Facility (OLCF), a DOE Office of Science user facility at ORNL.

Collaborating with Hanson are Noah Schaefferkoetter, API developer and data manager for MOSSAIC; research scientists Adam Spannaus, Mayanka Chandra Shekar, Zachary Fox, Alina Peluso, and Shamimul Hasan; statistician Kirsen Sullivan; postdoctoral research associates Shalini Priya and Jordan Miller; computational scientist John Gounley; doctoral student Christoph Metzner; and health data engineer Dakota Murdock. Significant algorithm development was done by Shang Gao, now a senior machine learning researcher at Casetext.

Processing data

The 19 SEER registries collect and publish cancer data gathered on every cancer case reported from 22 geographic areas in the United States. The data is a critical tool for tracking and understanding the nation’s cancer trends and for informing public health strategy and policy. According to Hanson, the work of sifting through these reports and classifying the data is traditionally done by human registrars. But because of the volume of reports, the public reporting of the data is usually about 2 years behind.

“For some things, maybe that isn’t a huge problem, but if there’s a big uptick in cancer, we want to know sooner than 2 years after the change in incidence occurs,” Hanson said. “So, we have an NLP-based algorithm that classifies each cancer report without a human having to look at it.”

NLP algorithms are a type of machine learning algorithm that can understand human language. They are trained to recognize patterns and key words by analyzing many target datasets—cancer pathology reports in this case. At ORNL, the team trains and tests their algorithms using CITADEL and the Knowledge Discovery Infrastructure. These security frameworks allow researchers to use the OLCF’s high performance computing systems for projects that include sensitive data, such as protected health information. The more data an algorithm is trained on, the more accurate and generalizable the algorithm becomes. That’s why having access to reports from across the country through SEER is so important.

The Case-Level Multi-Task Hierarchical Self Attention Network goes beyond analyzing a single report. Often, a single cancer case leads to multiple reports that are not automatically linked to one another. For example, a metastatic cancer case can have reports from several tissue samples that don’t reference the original diagnosis. Rather than report at the individual report level, this algorithm classifies cancer data at the case level by using and classifying data from a series of connected reports.

“Utilizing case-level information has allowed for a huge boost in the accuracy of the algorithm overall,” said Hanson.

Speeding up classification

The algorithm is already being used in 12 SEER registries and in 1 non-SEER registry and is processing thousands of reports per second. Working 18× faster than a human registrar, the algorithm saves 46,000 person hours per year. However, it has not fully replaced the experience and discernment of human registrars.

“Every report that’s read by the algorithm gets a confidence flag,” Hanson said. “Registrars review a subset of reports that meet the confidence threshold and conduct full reviews when reports do not meet the criteria.” Of the millions of reports screened each year, 17.5% are autocoded at the predefined confidence threshold of 97%. “It’s still helpful to get a suggestion from the computer about the classification when a report does not meet the threshold,” said Hanson.

As the algorithm is adopted by more registries, the impact on population-level reporting could be significant. The team is also working on privacy-preserving methods that would enable data- and model-centric approaches to train the algorithm in a distributed environment. These advances demonstrate the power of AI for improving population health surveillance.

“In time, we could see increases in cancer rates from an environmental exposure or from declines in diagnosis rates due to other public health crises, such as COVID-19, within a matter of months,” said Hanson. “The reduction in manual coding hours and the potential for near–real time cancer reporting are promising for the future of public health surveillance.”

UT-Battelle manages ORNL for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. The Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit https://energy.gov/science.