The world’s fastest supercomputer could help bring up-to-the-minute, pinpoint weather forecasts weeks in advance to your laptop, tablet or phone.
Researchers at the Department of Energy’s Oak Ridge National Laboratory used Frontier, the Oak Ridge Leadership Computing Facility’s 2-exaflop HPE Cray EX supercomputing system, to train the world’s largest artificial intelligence model for weather prediction. The study earned the team a finalist nomination for the Association for Computing Machinery Gordon Bell Prize for Climate Modeling, which honors innovations in applying high-performance computing to climate modeling applications.
The achievement could herald an era of fast, cheap and highly accurate weather forecasts, benefiting everyone from first responders and farmers to parents planning for the evening soccer game.
“These would be forecasts not just for the whole country or state, not just for the local region but for your or my address,” said Prasanna Balaprakash, ORNL’s director of AI programs. “Until now, it’s been very hard to do that, but AI is bringing these kinds of hyperlocal, hyperaccurate forecasts within reach.”
This year’s prize will be presented at the International Conference for High Performance Computing, Networking, Storage, and Analysis, Nov. 17 to 22. Balaprakash and his team will share the results of their study Nov. 19 at the conference.
The team’s Oak Ridge Base Foundation Model for Earth System Predictability, or ORBIT, draws on 113 billion model parameters to predict weather up to 30 days in advance. The accuracy rates for short-term forecasts (days or weeks in advance) range as high as 90 to 95%; for long-term forecasts (more than two weeks in advance), 60 to 80%.
“It’s relatively easy to predict the weather for tomorrow or the next day with high accuracy if we have good data,” Balaprakash said. “We want to extend that accuracy rate by several weeks. This would be a game changer for forecasting and for emergency response. We could adapt and tune the ORBIT model at a government level to predict hurricanes, tornados and flooding such as the recent disasters in North Carolina after Hurricane Helene, which would help save countless lives. Consumers could use it at a personal level to tailor their plans for travel, gardening or a backyard birthday party.”
Popular AI models such as OpenAI’s ChatGPT rely on large language models, which learn to recognize and predict patterns such as words by using large datasets, often scraped from the web. ORBIT relies on a foundation model that learns by analyzing a broad range of weather data including cloud formations, humidity, temperature and geography. Most AI models are trained on NVIDIA-based GPU platforms; ORBIT relied on Frontier, which uses AMD technology.
“No one has ever trained a weather model this large,” Balaprakash said. “We’re the first. Language and text are a single dimension. For weather and climate, a single point in space and time equals one dimension of data, and we have the whole planet to cover and about a hundred years’ worth of data or more for each of these points. It’s much more difficult. That’s why we needed an exascale machine with the power of Frontier to train our AI model.”
Training ORBIT required cycling through petabytes of data across 6,144 of Frontier’s more than 9,400 nodes as the model generated billions of potential scenarios. The simulations ultimately reached speeds of 1.6 exaflops, or 1.6 quintillion calculations per second — the fastest ever for an AI foundation model based on weather or climate.
“Our innovative approach to training AI foundation models boosts scalability and efficiency, enabling ORBIT to train with record-breaking performance,” said Xiao Wang, an ORNL computer scientist and co-author of the study. “ORBIT is designed to be architecture-agnostic, allowing it to excel across various computing platforms without being tied to specific hardware.”
Once trained, ORBIT can be updated as necessary but won’t need a machine the size or speed of Frontier again. That means versions of the model could be tailored to individual machines, from government servers to consumer laptops or even smartphones.
It’s relatively easy to predict the weather for tomorrow or the next day with high accuracy if we have good data. We want to extend that accuracy rate by several weeks. This would be a game changer for forecasting and for emergency response.
“It’s a huge potential savings not just in terms of time but of energy,” Balaprakash said. “The model only has to be trained once. We don’t need these big training runs after that. We can take the fine-tuned model, run it on a much smaller machine and get results in milliseconds. It’s basically a push-button technology that allows anyone to predict the weather.
“That’s one of the major advantages of machine learning for weather and climate: to act as a surrogate for traditional weather and climate simulations on a huge, powerful but energy-hungry supercomputer. It’s not a replacement for modeling and simulation. It’s the next step.”
To control for errors, the ORBIT team is building measures for uncertainty quantification that require the model to assess confidence in its forecasts and establish trustworthiness.
“The model can tell us how certain its predictions are,” said Dan Lu, an ORNL computational scientist and co-author of the study. “It can indicate whether its forecast is made with 95% confidence, 75%, or so on. Using fast and computationally cheap AI surrogate models, we can run several instances to estimate uncertainty and hope to enhance these measures as we increase its long-range accuracy.”
The team ultimately plans to fine-tune ORBIT to generate highly accurate forecasts months in advance. Agencies such as the Department of Defense and institutions such as the Tennessee Valley Authority have expressed interest in the model.
“We are not finished by any means,” Balaprakash said. “The Gordon Bell nomination is an honor, but it’s not an end point. It’s a starting point.”
Besides Balaprakash, Wang, and Lu, the ORBIT team included Siyan Liu, Aristeidis Tsaris, Jong-Youl Choi, Ming Fan, Wei Zhang, Junqi Yin, and Moetasim Ashfaq of ORNL and Ashwin M. Aji of AMD.
This research was supported by the ORNL AI Initiative and the DOE Office of Science’s Advanced Scientific Computing Research program. The OLCF is a DOE Office of Science user facility at ORNL.
UT-Battelle manages ORNL for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.