March 2023 OLCF User Conference Call: Checkpointing Best Practices for Frontier
The OLCF hosts monthly User Conference Calls. These calls are your opportunity to speak with center personnel to get the latest updates, express any concerns you may have, etc. No registration is required for this event.
Monthly Topic: Checkpointing Best Practices for Frontier
Speaker: Scott Atchley
Abstract:
When running simulations on large-scale systems, encountering a node failure or sudden crash of your compute job is inevitable. Therefore, one of the challenges when running simulations is trying to optimally output “checkpoint” data to avoid having to restart from scratch if you encounter such a scenario. OLCF’s Scott Atchley will discuss the checkpointing approach in general and how these methodologies can be applied for Frontier. This is an “encore” version of the talk Scott gave for the Frontier Training Workshop in February.
Presentation: Slides | Recording
Remote Attendees: Zoom coordinates will be sent to the OLCF Users approximately one week before the event.