Image

AI Training Series: Enhancing PyTorch Performance on Frontier with the RCCL/OFI-plugin

Enhancing PyTorch Performance on Frontier with the RCCL/OFI-plugin
Wednesday, April 17, 2024

The AWS-OFI-RCCL plugin enables using libfabric as a network provider while running AMD’s RCCL based applications. This plugin can be built and used by common ML/DL libraries like PyTorch to increase performance when running on AMD GPUs.

This seminar will cover how to run PyTorch on Frontier in a distributed, multi-node regime using the aws-ofi-rccl plugin. More specifically, this talk will cover the basics of RCCL, an overview of the plugin, PyTorch use-cases, and profiling examples. In addition to this hour long seminar, changes to our documentation will be pushed to include the instructions discussed at this event (how to create your PyTorch environment on Frontier to use the plugin). This seminar is intended for OLCF users that have an allocation on Frontier, but all are welcome to join and view the presentation.

Speakers: Mengshiou Wu (HPE), Mark Stock (HPE)

Time Topic Speaker
1:00 pm – 1:30 pm RCCL, RCCL Tester, and the Plugin Mark Stock (HPE)
1:30 pm – 2:00 pm PyTorch Examples and Profiling Mengshiou Wu (HPE)

 

https://www.zoomgov.com/j/1604500192


Registration
Registration has closed.
Recording
(slides | recording)

Date

Apr 17 2024

Time

(Eastern Time)
1:00 pm - 2:00 pm

Location

Webcast
Category

Organizer

Michael Sandoval
Email
sandovalma@ornl.gov
QR Code

Comments are closed.