Oak Ridge supercomputer will help identify producers of child pornography

Consumers of child pornography break the law when they download photos and videos from file-sharing networks. But police are more concerned with the porn producers uploading the files. Every new posting means a child is in harm’s way. To accelerate the acquisition of information needed to arrest child predators, law enforcement officers have teamed with data analytics experts at Oak Ridge National Laboratory (ORNL) for a project that will use Jaguar, one of the world’s fastest supercomputers, to speedily analyze the activities on file-sharing networks that pinpoint porn producers.

“A consumer can be put in jail, but we’d rather go after the producer because there’s a child we can rescue,” said ORNL principal investigator Robert Patton, who develops algorithms to search, count, and characterize data files so similar information can be clustered for faster processing. Patton works with Internet Crimes Against Children (ICAC) task forces from local, state, and federal law enforcement agencies to find new files entering file-sharing networks. “There is a likelihood new material was just generated,” Patton said. “We want to know who put it out there.”

Patton’s project uses clustering algorithms that speed the search for a needle in a haystack of data. Clustering algorithms are employed in diverse applications, from finding flaws in the electronic controls of nuclear power plants to tracking suspected terrorists.

Patton has received funding from an industry sponsor, which applies the algorithm for cybersecurity, and an allocation during a 1-year period spanning 2010 and 2011 of 1 million processor hours on Jaguar, a Department of Energy Office of Science supercomputer housed at the Oak Ridge Leadership Computing Facility. Jaguar is mainly used by scientists and engineers requiring maximal computing power to solve large scientific simulation problems in the shortest times possible. Patton has used part of the allocation for initial runs to test some clustering algorithms and later will use the remainder for further development and testing using data provided by law enforcement.

With postdoctoral fellow Carlos Rojas, Patton is developing a clustering algorithm that will scale to use as many of Jaguar’s 224,256 processors as possible. Just as several workers can move a pile of dirt faster than one laborer working alone can, processors working in parallel at the algorithm’s instruction can speed analysis of millions of files from hard drives obtained through consent or confiscation.

Different subcategories of child pornography exist, and sentencing depends on the amount and the type of pornography on a computer. Today law enforcement officers must go through every file to determine its type and then count how many files of each type exist. Supercomputers can speed that analysis.

To track a predator

Sifting through a mountain of data is daunting. File sizes may range from mere kilobytes for a thumbnail image to megabytes for a picture and gigabytes for a movie. “Just one predator may have a few hundred thousand picture or movie files,” Patton said. “They have captured guys who have up to 10 terabytes worth of material.”

Before law enforcement agents provide the researchers with data from the hard drives of suspects’ computers, each of which may hold 10,000 or 100,000 files, officers reduce the data set to a size manageable on a desktop computer and encode, or “de-identify,” the data. The researchers see no photos or videos. They work with only characteristics of the data and develop products from there.

For example, in 2009 Thomas Potok’s Applied Software Engineering Research group and Shaun Gleason’s Image Science and Machine Vision group at ORNL developed software called Artemis (named for the mythical Greek huntress) for finding images with a high percentage of high skin tones. “The police put the Artemis CD-ROM into a computer of the person they’re investigating, and it will scan the hard drive and come back and say, here’s material that’s questionable,” Patton explained. It can take ICAC teams weeks or months to analyze files that Artemis can fly through in mere minutes. Use of Artemis by the Knoxville Police Department has led to approximately 30 arrests this year, according to Patton.

Rojas and Patton also analyze Internet traffic and trace the connections by which material is shared on file-sharing networks or websites. “We’ll get IP [Internet Protocol] addresses, certain session IDs, and then we’ll create a map,” Patton explained. “We’re trying to get a map to say, this IP address requested a search of this particular term, which is considered child pornography related, and these IP addresses responded to that, and the person behind that IP address then proceeded to download these files.”

“Across the globe criminals are using technology to facilitate the sexual exploitation of children,” said Grier Weeks, executive director of the National Association to Protect Children, or PROTECT. “Police are overwhelmed and outnumbered. These Oak Ridge scientists are the good guys we’ve been waiting for. Their computers will become child rescue engines.”

Taking a byte out of crime

The Internet is a sea of text that clustering algorithms can navigate faster than searches using key words. If you use Google to search for a key word, the search engine provides a list of links, many of which are similar to each other. “You may go to the first one and then when you get to the fifth one you may say, contentwise, it’s very similar to the first one. But you don’t know that until you click through all of them,” Patton explained. “Clustering would go through that set of results and say, links 1 and 5 are very similar; they cluster together. Link 2 and link 8 go together. And so forth.”

File-sharing logs from law enforcement. The blue dot represents the file-sharing session. Red dots are computers offering files, and orange dots are child pornography files offered. Green dots indicate other potential crimes. Image credit: Analysis done by Tom Potok.

File-sharing logs from law enforcement. The blue dot represents the file-sharing session. Red dots are computers offering files, and orange dots are child pornography files offered. Green dots indicate other potential crimes. Image credit: Analysis done by Tom Potok.

As data analyses scale up to deal with millions of documents, algorithms can find the most efficient way to group similar items. Flat clustering on graphics cards can group a million documents in 12 minutes, Patton said, whereas hierarchical clustering on a desktop computer, which prioritizes groups for further analysis, takes three times as long. But in a complex analysis, prioritization can save time. Data clusters can be distributed on the hundreds of thousands of processors of a leadership-class supercomputer for analysis. “What we’re trying to do is come up with an efficient way of clustering data that is distributed all over the place,” Patton said.

Running a scalable clustering algorithm on Jaguar would prove that high-performance computing (HPC) can speed identification of child predators. Then local, state, and national law enforcement entities could decide whether investment in software, computing clusters, or even supercomputers would make sense for their jurisdictions. If the investment was justified, organizations such as PROTECT could lobby Congress for resources to fight child pornography. Just as the Automated Fingerprint Identification System helps the FBI spot suspects, a computational capability to quickly compare the contents of child porn repositories with new content on the Internet or to speed through hard drives confiscated from suspects could help police identify people who may be harming children. Patton said today’s inadequate sharing of data among local, state, and federal agencies makes it difficult to identify producers of child porn. “You’d be able to put together the data from all these different groups that are right now their own stovepipes,” said Patton.

“[Running clustering algorithms on HPC systems for data analysis] would bring a deeper understanding of the characteristics of the data,” Patton said. “Ultimately what we want to get to is the knowledge that’s contained in that data. How many times have we gone to Google and done a search, and we’re looking for some piece of information, and when we find it we get to the point of going, ‘I didn’t know that. Now I know that. Now I can take some action.’ It’s that process we want to speed up.”—by Dawn Levy