Titan Overview
System description: The Oak Ridge Leadership Computing Facility (OLCF) has completed the first phase of an upgrade of the Jaguar system that will result in a hybrid-architecture Cray XK6 system named Titan, with a peak theoretical performance of more than 20 petaflops. In the first phase of the upgrade, completed in February 2012, each of Jaguar’s 18,688 Cray XT5 compute nodes was replaced with a new Cray XK6 compute node that consists of one AMD 16-core Opteron 6274 processor running at 2.2 GHz, 32 gigabytes of DDR3 memory, and Cray’s new high performance Gemini network providing higher bandwidth, lower latency, faster collectives, and greater reliability than the previous generation XT5 nodes. The upgraded Jaguar system has a total of 299,008 AMD Opteron CPU cores, 600 terabytes of memory, and is connected to the 240 GB/s Spider file system. Phase I of this upgrade also populated 960 of these XK6 nodes with NVIDIA Fermi GPUs.The second phase of the upgrade will begin in the fall of 2012 when the 960 Fermi accelerators will be removed and 14,592 of Titan’s nodes, 78 percent of the total compute nodes, will be upgraded by adding NVIDIA “Kepler” GPU application accelerators with 6 gigabytes of high speed directly attached memory. We expect the Kepler-accelerated nodes to be available for users in early CY2013. This is well in advance of the plan-of-record commitment to place the Kepler-accelerated nodes into production use by January 1, 2014. This plan-of-record has 4,096 XK6 compute nodes remaining without an accelerator and available to users throughout CY 2013.
Users of Titan will continue to have access to the Spider file system, with 240 GB/s data bandwidth and over 10 PB of storage capacity. The OLCF will upgrade Spider in 2013 to increase both bandwidth and capacity. The users will also have access to the HPSS data archive, LENS data analysis and visualization cluster, and the newly upgraded EVEREST high resolution visualization facility. All of these resources are available through high performance networks including ESnet’s recently upgraded 100 gigabit per second links.
Allocable hours and availability for INCITE:
For proposal planning purposes and assuming that acceptance of accelerated nodes occurs in March, 2013, the Titan would provide up to 2 billion core-hours to users in CY2013. With this assumption, we anticipate that the average INCITE project allocation could grow to between 50 million and 100 million Titan core-hours.
The allocation and charge units – Titan core-hours – will be based on the accelerated XK6 nodes. For these purposes, each XK6 node will be defined as possessing 30 total cores (e.g. 16 CPU cores + 14 GPU core equivalents). Notably, codes that do not take advantage of GPUs will have only 16 CPU cores available per node; however, allocation requests – and units charged – will be based on 30 cores per node.
The author’s allocation request must be in terms of these Titan core-hours. Nonetheless, the proposal narrative should contain detailed estimates for CPU core requirements and GPU core requirements and each should be tied to the overall milestones and goals of the computational campaign.
The OLCF expects to have CPU resources available for the entire CY 2013 and GPU resources available as soon as acceptance is completed, probably in the first half of the year. As is done today on Jaguar, jobs on the Titan system will be scheduled in full node increments. Contact the OLCF for any questions regarding the estimation of Titan core-hours needed for the proposal.
The INCITE program seeks proposals for high-impact science and technology research challenges that require the power of leadership-class systems. Nearly 5 billion core-hours will be allocated for calendar year 2013 on the 10-petaflops IBM BG/Q “Mira” and the 20-petaflops Cray XK6 “Titan.” See http://hpc.science.doe.gov for the call for proposals details.




