AI Training Series: AI for Science at Scale – Part 2
AI for Science at Scale – Part 2
Thursday, October 12, 2023
Training large deep learning models, including large language models, is resource-intensive and requires innovative parallelization and distribution strategies. In the earlier workshop, we demonstrated how to train a deep learning model in a distributed fashion across multiple GPUs of the Summit supercomputer using data parallelism. Building on this, we will show how to train a model on multiple GPUs across nodes of the Frontier supercomputer. We will demonstrate and focus on model parallelism techniques and frameworks, such as DeepSpeed, FSDP, and Megatron.
Series Github page: https://github.com/olcf/ai-training-series
Slides:
Recording:
Agenda:
Time | Topic | Speaker |
---|---|---|
1:00 pm – 1:45 pm EDT | Scaling, LLMs | Sajal Dash (OLCF, Analytics & AI Methods at Scale) |
1:45 pm – 2:00 pm EDT | Scientific Applications | Sajal Dash |
2:00 pm – 3:00 pm EDT | Hands-on Examples | Sajal Dash |
100 registrations limit reached -- registration is now closed
Joining information will be sent to you in a calendar event before the event.

Comments are closed.