AI Training Series: AI for Science at Scale – Part 2
AI for Science at Scale – Part 2
Thursday, October 12, 2023
Training large deep learning models, including large language models, is resource-intensive and requires innovative parallelization and distribution strategies. In the earlier workshop, we demonstrated how to train a deep learning model in a distributed fashion across multiple GPUs of the Summit supercomputer using data parallelism. Building on this, we will show how to train a model on multiple GPUs across nodes of the Frontier supercomputer. We will demonstrate and focus on model parallelism techniques and frameworks, such as DeepSpeed, FSDP, and Megatron.
Presenter: Dr. Sajal Dash, OLCF – Analytics & AI Methods at Scale
Series Github page: https://github.com/olcf/ai-training-series
[tw-toggle title=”Agenda”]TIME | CONTENT | SPEAKER |
---|---|---|
1:00 p.m. – 1:10 p.m. | Recap Part I | Sajal Dash |
1:10 p.m. – 1:20 p.m. | Intro to DDP | Sajal Dash |
1:20 p.m. – 1:30 p.m. | Intro to FSDP | Sajal Dash |
1:30 p.m. – 1:45 p.m. | 3D Parallelism | Sajal Dash |
1:45 p.m. – 2:00 p.m. | Case-study: Forge | Sajal Dash |
2:00 p.m. – 2:20 p.m. | DDP Hands-on | Sajal Dash |
2:20 p.m. – 2:40 p.m. | FSDP Hands-on | Sajal Dash |
2:40 p.m. – 3:00 p.m. | Megatron-DeepSpeed Hands-on | Sajal Dash |