2025 Best Practices for AI on Frontier

AI Training Series Part 3: 2025 Best Practices for AI on Frontier

March 28 2025 | 1:00–2:30 PM EST
Location: Zoom

Spring 2025 AI Focused Frontier Hackathon & Frontier AI Training Series

Join us for a training session on best practices for running AI workloads on Frontier, covering key tools, strategies, and performance insights. We’ll start with an overview of the AI software stack and documentation, followed by practical guidance on optimizing AI workflows for the system.

Training Presenters

Alessandro Fanfarillo, Senior Technical Staff, AMD

Topics include:

Scaling AI Training on Frontier – Training large machine learning models efficiently requires advanced parallelization and distribution strategies. We’ll demonstrate how to train a model across multiple GPUs and nodes, with a focus on model parallelism techniques and frameworks like DeepSpeed and PyTorch FSDP.
Single node performance optimization – We’ll break down GEMM (General Matrix Multiply) performance, explaining what to expect and best practices to optimize your performance .
Improving I/O – Data movement can be a major bottleneck for AI training. We discuss how to use the on-node NVMEs on Frontier for better I/O performance.

This training is highly recommended for anyone preparing to run AI workloads on Frontier. Teams selected for the Spring 2025 AI Focused Frontier Hackathon in April are strongly encouraged to attend.