2025 Best Practices for AI on Frontier
AI Training Series Part 3: 2025 Best Practices for AI on Frontier
March 28 2025 | 1:00–2:30 PM EST
Location: Zoom
Spring 2025 AI Focused Frontier Hackathon & Frontier AI Training Series
Register HereJoin us for a training session on best practices for running AI workloads on Frontier, covering key tools, strategies, and performance insights. We’ll start with an overview of the AI software stack and documentation, followed by practical guidance on optimizing AI workflows for the system.
Training Presenters
Alessandro Fanfarillo, Senior Technical Staff, AMD
Topics include:
- Scaling AI Training on Frontier – Training large machine learning models efficiently requires advanced parallelization and distribution strategies. We’ll demonstrate how to train a model across multiple GPUs and nodes, with a focus on model parallelism techniques and frameworks like DeepSpeed and PyTorch FSDP.
- Single node performance optimization – We’ll break down GEMM (General Matrix Multiply) performance, explaining what to expect and best practices to optimize your performance .
- Improving I/O – Data movement can be a major bottleneck for AI training. We discuss how to use the on-node NVMEs on Frontier for better I/O performance.
This training is highly recommended for anyone preparing to run AI workloads on Frontier. Teams selected for the Spring 2025 AI Focused Frontier Hackathon in April are strongly encouraged to attend.
2025 Best Practices for AI on Frontier Agenda
Time | Topic | Presenter |
---|---|---|
1:00 p.m. (EST) | High level overview of AI software stack & documentation | ORNL |
1:10 p.m. (EST) | AI Training at Scale on Frontier | ORNL |
1:55 p.m. (EST) | Single Node Optimization | AMD |
2:10 p.m. (EST) | Other Best Practices | ORNL |
2:20 p.m. (EST) | Q&A Discussion |
Training Presentation Materials
2025 AI Best Practices for Frontier Presentations
Presentation | Presenter |
---|---|
OLCF Training: 2025 Best Practices for AI on Frontier | Aristeidis Tsaris, JunqiYin, Sajal Dash (ORNL) |
Distributed Training of LLMs on Frontier | Sajal Dash (ORNL) |
Overveiw of AI Stack on Frontier | Junqi Yin (ORNL) |
GEMM User Tuning for AI Workloads | Alessandro Fanfarillo (AMD) |
