AI for Science at Scale – Part 2
Thursday, October 12, 2023

Training large deep learning models, including large language models, is resource-intensive and requires innovative parallelization and distribution strategies. In the earlier workshop, we demonstrated how to train a deep learning model in a distributed fashion across multiple GPUs of the Summit supercomputer using data parallelism. Building on this, we will show how to train a model on multiple GPUs across nodes of the Frontier supercomputer. We will demonstrate and focus on model parallelism techniques and frameworks, such as DeepSpeed, FSDP, and Megatron.

Presenter: Dr. Sajal Dash, OLCF – Analytics & AI Methods at Scale

Series Github page:

Slides | Recording

1:00 p.m. – 1:10 p.m.Recap Part ISajal Dash
1:10 p.m. – 1:20 p.m.Intro to DDPSajal Dash
1:20 p.m. – 1:30 p.m.Intro to FSDPSajal Dash
1:30 p.m. – 1:45 p.m.3D ParallelismSajal Dash
1:45 p.m. – 2:00 p.m.Case-study: ForgeSajal Dash
2:00 p.m. – 2:20 p.m.DDP Hands-onSajal Dash
2:20 p.m. – 2:40 p.m.FSDP Hands-onSajal Dash
2:40 p.m. – 3:00 p.m.Megatron-DeepSpeed Hands-onSajal Dash


(Eastern Time)
1:00 pm - 3:00 pm


