Skip to main content

2025 Best Practices for AI on Frontier

AI Training Series Part 3: 2025 Best Practices for AI on Frontier

March 28 2025 | 1:00–2:30 PM EST 
Location: Zoom

Spring 2025 AI Focused Frontier Hackathon & Frontier AI Training Series

Register Here

Join us for a training session on best practices for running AI workloads on Frontier, covering key tools, strategies, and performance insights. We’ll start with an overview of the AI software stack and documentation, followed by practical guidance on optimizing AI workflows for the system.

Training Presenters

Alessandro Fanfarillo, Senior Technical Staff, AMD

Topics include:

  • Scaling AI Training on Frontier – Training large machine learning models efficiently requires advanced parallelization and distribution strategies. We’ll demonstrate how to train a model across multiple GPUs and nodes, with a focus on model parallelism techniques and frameworks like DeepSpeed and PyTorch FSDP.
  • Single node performance optimization – We’ll break down GEMM (General Matrix Multiply) performance, explaining what to expect and best practices  to optimize your performance .
  • Improving I/O – Data movement can be a major bottleneck for AI training.  We discuss  how to use the on-node NVMEs on Frontier for better I/O performance.

This training is highly recommended for anyone preparing to run AI workloads on Frontier. Teams selected for the Spring 2025 AI Focused Frontier Hackathon in April  are strongly encouraged to attend.

2025 Best Practices for AI on Frontier Agenda

TimeTopicPresenter
1:00 p.m. (EST)High level overview of AI software stack & documentationORNL
1:10 p.m. (EST)AI Training at Scale on FrontierORNL
1:55 p.m. (EST)Single Node OptimizationAMD
2:10 p.m. (EST)Other Best PracticesORNL
2:20 p.m. (EST)Q&A Discussion

Training Presentation Materials

Training Recording

2025 AI Best Practices for Frontier Presentations

PresentationPresenter
OLCF Training: 2025 Best Practices for AI on FrontierAristeidis Tsaris, JunqiYin, Sajal Dash (ORNL)
Distributed Training of LLMs on FrontierSajal Dash (ORNL)
Overveiw of AI Stack on FrontierJunqi Yin (ORNL)
GEMM User Tuning for AI WorkloadsAlessandro Fanfarillo (AMD)

Date

Mar 28 2025
Expired!

Time

1:00 pm - 2:30 pm
Category

Organizer

Asim YarKhan
Email
[email protected]
Website
Asim YarKhan
QR Code