Staff Section Head - Kevin Thach

The Systems Section supports the NCCS’s computing, networking, and storage systems, including general support of critical computational and facilities-related infrastructure and systems. They administer and support high-speed parallel file systems and archive capabilities, and develop tools and administer data management platforms to ensure security, operational, and laboratory policy compliance.

Section Groups

Group Group Description
HPC Clusters

The HPC Clusters Group administers and supports the division’s large-scale cluster computing infrastructure, which includes system installation, deployment, acceptance, performance testing, upgrades, problem diagnosis, and troubleshooting.

HPC Cybersecurity & Information Engineering

The HPC Cybersecurity & Information Engineering Group develops tools and administers data management platforms to extract and analyze telemetry, event logs, and system state information to ensure security and laboratory policy compliance.

HPC Infrastructure & Networking

The HPC Infrastructure & Networking Group designs, implements, and operates all infrastructure systems and networking services in a that are common to all other systems groups within NCCS, such as HPC Scalable Systems, HPC Storage and Archive, etc. In addition, the group provides the OLCF Slate Service. Slate, built on Kubernetes and the Red Hat OpenShift Container Platform, provides a container orchestration service for running user-managed persistent applications that run along side the OLCF SuperComputer Systems and other OLCF managed HPC clusters.

HPC Infrastructure Operations

The HPC Infrastructure Operations Group provides continuous monitoring, issue triaging and escalation, and general support of critical computational and facilities-related infrastructure.

HPC Scalable Systems

The HPC Scalable Systems Group administers and supports system installation, deployment, acceptance, performance testing, upgrades, problem diagnosis, and troubleshooting of HPC computational resources.

HPC Storage & Archive

The HPC Storage and Archive group is responsible for the high-performance scratch and archival storage systems across the various programs in NCCS. Our work entails the entire lifecycle of high-performance storage systems: requirements gathering, design, procurement, install, acceptance, data life cycle, system upkeep, on call, and decommissioning. We collaborate externally from ORNL with other DOE labs, vendors, and industrial partners to anticipate and solve future storage demands. We also collaborate internally to ORNL to help evaluate new storage platforms for production feasibility and to operate the TechInt testbed.