AI Training Series: AI for Science at Scale – Part 2

Name: AI Training Series: AI for Science at Scale - Part 2
Start: 2023-10-12
End: 2023-10-12
Location: Webcast

AI for Science at Scale – Part 2
Thursday, October 12, 2023

Training large deep learning models, including large language models, is resource-intensive and requires innovative parallelization and distribution strategies. In the earlier workshop, we demonstrated how to train a deep learning model in a distributed fashion across multiple GPUs of the Summit supercomputer using data parallelism. Building on this, we will show how to train a model on multiple GPUs across nodes of the Frontier supercomputer. We will demonstrate and focus on model parallelism techniques and frameworks, such as DeepSpeed, FSDP, and Megatron.

Presenter: Dr. Sajal Dash, OLCF – Analytics & AI Methods at Scale

Series Github page: https://github.com/olcf/ai-training-series

Slides | Recording

[tw-toggle title=”Agenda”]

TIME	CONTENT	SPEAKER
1:00 p.m. – 1:10 p.m.	Recap Part I	Sajal Dash
1:10 p.m. – 1:20 p.m.	Intro to DDP	Sajal Dash
1:20 p.m. – 1:30 p.m.	Intro to FSDP	Sajal Dash
1:30 p.m. – 1:45 p.m.	3D Parallelism	Sajal Dash
1:45 p.m. – 2:00 p.m.	Case-study: Forge	Sajal Dash
2:00 p.m. – 2:20 p.m.	DDP Hands-on	Sajal Dash
2:20 p.m. – 2:40 p.m.	FSDP Hands-on	Sajal Dash
2:40 p.m. – 3:00 p.m.	Megatron-DeepSpeed Hands-on	Sajal Dash

[/tw-toggle] [tw-toggle title=”Presentations”]

(slides | recording)

[/tw-toggle]

Date

Oct 12 2023

Expired!

Time

(Eastern Time)

1:00 pm - 3:00 pm

AI Training Series: AI for Science at Scale – Part 2

Date

Time

Location

Webcast

Contact Us

Quick Links

Connect with OLCF

AI Training Series: AI for Science at Scale – Part 2

Date

Time

Location

Webcast

Share this event

Contact Us

Quick Links

Connect with OLCF