Note: For GPU specific compilation details please see the Accelerated Computing Guide.

Compiling code on Titan (and other Cray machines) is different than compiling code for commodity or beowulf-style HPC linux clusters. Among the most prominent differences:

  • Cray provides a sophisticated set of compiler wrappers to ensure that the compile environment is setup correctly. Their use is highly encourged.
  • In general, linking/using shared object libraries on compute partitions is not supported.
  • Cray systems include many different types of nodes, so some compiles are, in fact, cross-compiles.

Available Compilers

The following compilers are available on Titan:

  • PGI, the Portland Group Compiler Suite (default)
  • GCC, the GNU Compiler Collection
  • CCE, the Cray Compiling Environment
  • Intel, Intel Composer XE

Upon login, the default versions of the PGI compiler and associated Message Passing Interface (MPI) libraries are added to each user’s environment through a programming environment module. Users do not need to make any environment changes to use the default version of PGI and MPI.


Cray Compiler Wrappers

Cray provides a number of compiler wrappers that substitute for the traditional compiler invocation commands. The wrappers call the appropriate compiler, add the appropriate header files, and link against the appropriate libraries based on the currently loaded programming environment module. To build codes for the compute nodes, you should invoke the Cray wrappers via:

Wrapper Purpose
cc To use the C compiler
CC To use the C++ compiler
ftn To use the FORTRAN compiler

The -craype-verbose option can be used to view the compile line built by the compiler wrapper:

titan-ext$ cc -craype-verbose ./a.out
pgcc -tp=bulldozer -Bstatic ...

Compiling and Node Types

Titan is comprised of different types of nodes:

  • Login nodes running traditional Linux
  • Service nodes running traditional Linux
  • Compute nodes running the Cray Node Linux (CNL) microkernel

The type of work you are performing will dictate the type of node for which you build your code.

Compiling for Compute Nodes (Cross Compilation)

Titan compute nodes are the nodes that carry out the vast majority of computation on the system. Compute nodes are running the CNL microkernel, which is markedly different than the OS running on the login and service nodes. Most code that runs on Titan will be built targeting the compute nodes.

All parallel codes should run on the compute nodes. Compute nodes are accessible only by invoking aprun within a batch job. To build codes for the compute nodes, you should use the Cray compiler wrappers:

titan-ext$ cc code.c
titan-ext$ CC code.cc
titan-ext$ ftn code.f90
Note: The OLCF highly recommends that the Cray-provided cc, CC, and ftn compiler wrappers be used when compiling and linking source code for use on Titan compute nodes.

Support for Shared Object Libraries

On Titan, and Cray machines in general, statically linked executables will perform better and are easier to launch. Depending on the module files you load, certain Cray-provided modules and libraries (such as mpich2 and cudart) may employ and configure dynamic linking automatically; the following warnings do not apply to them.

Warning: In general, the use of shared object libraries is strongly discouraged on Cray compute nodes. This excludes certain Cray-provided modules and libraries such as mpich2 and cudart. Any necessary dynamic linking of these libraries will be configured automatically by the Cray compiler wrappers.

If you must use shared object libraries, you will need to copy all necessary libraries to a Lustre scratch area ( $MEMBERWORK or $PROJWORK) or the NFS /ccs/proj area and then update your LD_LIBRARY_PATH environment variable to include this directory. Due to the meta-data overhead, /ccs/proj is the suggested area for shared objects and python modules. For example, the following command will append your project’s NFS home area to the LD_LIBRARY_PATH in bash:

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/ccs/proj/[projid]

Compiling with shared libraries can be further complicated by the fact that Titan’s login nodes do not run the same operating system as the compute nodes, and thus many shared libraries are available on the login nodes which are not available on the compute nodes. This means that an executable may appear to compile correctly on a login node, but will fail to start on a compute node because it will be unable to locate the shared libraries it needs.

It may appear that this could be resolved by locating the shared libraries on the login node and copying them to Lustre or /ccs/proj for use on the compute nodes. This is inadvisable because these shared libraries were not compiled for the compute nodes, and may perform erratically. Also, referring directly to these libraries circumvents the module system, and may jeopardize your deployment environment in the event of system upgrades.

For performance considerations, it is important to bear in mind that each node in your job will need to search through $LD_LIBRARY_PATH for each missing dynamic library, which could cause a bottleneck with the Lustre Metadata Server. Lastly, calls to functions in dynamic libraries will not benefit from compiler optimizations that are available when using static libraries.

Compiling for Login or Service Nodes

When you log into Titan you are placed on a login node. When you submit a job for execution, your job script is initially launched on one of a small number of shared service nodes. All tasks not launched through aprun will run on the service node. Users should note that there are a small number of these login and service nodes, and they are shared by all users. Because of this, long-running or memory-intensive work should not be performed on login nodes nor service nodes.

Warning: Long-running or memory-intensive codes should not be compiled for use on login nodes nor service nodes.

When using cc, CC, or ftn your code will be built for the compute nodes by default. If you wish to build code for the Titan login nodes or service nodes, you must do one of the following:

  1. Add the -target-cpu=mc8 flag to your cc, CC, or ftn command
  2. Use the craype-mc8 module: module swap craype-interlagos craype-mc8
  3. Call the underlying compilers directly (e.g. pgf90, ifort, gcc)
  4. Use the craype-network-none module to remove the network and MPI libraries: module swap craype-network-gemini craype-network-none

XK7 Service/Compute Node Incompatibilities

On the Cray XK7 architecture, service nodes differ greatly from the compute nodes. The difference between XK7 compute and service nodes may cause cross compiling issues that did not exist on Cray XT5 systems and prior.

For XK7, login and service nodes use AMD’s Istanbul-based processor, while compute nodes use the newer Interlagos-based processor. Interlagos-based processors include instructions not found on Istanbul-based processors, so executables compiled for the compute nodes will not run on the login nodes nor service nodes; typically crashing with an illegal instruction error. Additionally, codes compiled specifically for the login or service nodes will not run optimally on the compute nodes.

Warning: Executables compiled for the XK7 compute nodes will not run on the XK7 login nodes nor XK7 service nodes.

Optimization Target Warning

Because of the difference between the login/service nodes (on which code is built) and the compute nodes (on which code is run), a software package’s build process may inject optimization flags incorrectly targeting the login/service nodes. Users are strongly urged to check makefiles for CPU-specific optimization flags (ex: -tp, -hcpu, -march). Users should not need to set such flags; the Cray compiler wrappers will automatically add CPU-specific flags to the build. Choosing the incorrect processor optimization target can negatively impact code performance.


Changing Compilers

If a different compiler is required, it is important to use the correct environment for each compiler. To aid users in pairing the correct compiler and environment, programming environment modules are provided. The programming environment modules will load the correct pairing of compiler version, message passing libraries, and other items required to build and run. We highly recommend that the programming environment modules be used when changing compiler vendors.

The following programming environment modules are available on Titan:

  • PrgEnv-pgi (default)
  • PrgEnv-gnu
  • PrgEnv-cray
  • PrgEnv-intel

To change the default loaded PGI environment to the default GCC environment use:

$ module unload PrgEnv-pgi 
$ module load PrgEnv-gnu

Or alternatively:

$ module swap PrgEnv-pgi PrgEnv-gnu

Changing Versions of the Same Compiler

To use a specific compiler version, you must first ensure the compiler’s PrgEnv module is loaded, and then swap to the correct compiler version. For example, the following will configure the environment to use the GCC compilers, then load a non-default GCC compiler version:

$ module swap PrgEnv-pgi PrgEnv-gnu
$ module swap gcc gcc/4.6.1
Note: we recommend the following general guidelines for using the programming environment modules:

  • Do not purge all modules; rather, use the default module environment provided at the time of login, and modify it.
  • Do not swap or unload any of the Cray provided modules (those with names like xt-*, xe-*, xk-*, or cray-*).
  • Do not swap moab, torque, or MySQL modules after loading a programming environment modulefile.

Compiling Threaded Codes

When building threaded codes on Cray machines, you may need to take additional steps to ensure a proper build.

OpenMP

For PGI, add “-mp” to the build line.

$ cc -mp test.c -o test.x
$ setenv OMP_NUM_THREADS 2
$ aprun -n2 -d2 ./test.x

For GNU, add “-fopenmp” to the build line.

$ cc -fopenmp test.c -o test.x
$ setenv OMP_NUM_THREADS 2
$ aprun -n2 -d2 ./test.x

For Cray, no additional flags are required.

$ module swap PrgEnv-pgi PrgEnv-cray
$ cc test.c -o test.x
$ setenv OMP_NUM_THREADS 2
$ aprun -n2 -d2 ./test.x

For Intel,  -qopenmp

$ module swap PrgEnv-pgi PrgEnv-intel
$ cc -qopenmp test.c -o test.x
$ setenv OMP_NUM_THREADS 2
$ aprun -n2 -d2 ./test.x

Running with OpenMP and PrgEnv-intel

An extra thread created by the Intel OpenMP runtime interacts with the CLE thread binding mechanism and causes poor performance. To work around this issue, CPU-binding should be turned off. This is only an issue for OpenMP with the Intel programming environment.

How CPU-binding is shut off depends on how the job is placed on the node. In the following examples, we refer to the number of threads per MPI task as depth; this is controlled by the -d option to aprun. We refer to the number of MPI tasks or processing elements per socket as npes; this is controlled by the -n option to aprun. In the following examples, replace depth with the number of threads per MPI task, and npes with the number of MPI tasks (processing elements) per socket that you plan to use. 

For the case of running when depth divides evenly into the number of processing elements on a socket (npes),

$ export OMP_NUM_THREADS="<=depth" 
$ aprun -n npes -d "depth" -cc numa_node a.out

For the case of running when depth does not divide evenly into the number of processing elements on a socket (npes),

$ export OMP_NUM_THREADS="<=depth" 
$ aprun -n npes -d “depth” -cc none a.out

SHMEM

For SHMEM codes, users must load the cray-shmem module before compiling:

$ module load cray-shmem

Accelerator Compiler Directives

Accelerator compiler directives allow the compiler, guided by the programmer, to take care of low-level accelerator work. One of the main benefits of a directives-based approach is an easier and faster transition of existing code compared to low-level GPU languages. Additional benefits include performance enhancements that are transparent to the end developer and greater portability between current and future many-core architectures. Although initially several vendors provided their own set of proprietary directives OpenACC has now provided a unified specification for accelerator directives.

OpenACC

OpenACC aims to provide an open accelerator interface consisting primarily of compiler directives. Currently PGI, Cray, and CapsMC provide OpenACC implementations for C/C++ and Fortran. OpenACC aims to provide a portable cross platform solution for accelerator programming.

Using OpenACC with C/C++

PGI Compiler

$ module load cudatoolkit
$ cc -acc vecAdd.c -o vecAdd.out

Cray Compiler

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ cc -h pragma=acc vecAdd.c -o vecAdd.out

CapsMC Compiler

$ module load cudatoolkit capsmc
$ cc vecAdd.c -o vecAdd.out

Using OpenACC with Fortran

PGI Compiler

$ module load cudatoolkit
$ ftn -acc vecAdd.f90 -o vecAdd.out

Cray Compiler

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ ftn -h acc vecAdd.f90 -o vecAdd.out

CapsMC Compiler

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit capsmc
$ ftn vecAdd.f90 -o vecAdd.out

OpenACC Tutorials and Resources

The OpenACC specification provides the basis for all OpenACC implementations and is available OpenACC specification . In addition the implementation specific documentation may be of use. PGI has a site dedicated to collecting OpenACC resources. Chapter 5 of the Cray C and C++ Reference Manual provides details on Crays implementation. CapsMC has provided an OpenACC Reference Manual.

The OLCF provides Vector Addition and Game of Life example codes demonstrating the OpenACC accelerator directives.

PGI Accelerator

Note: This section covers environments and compiler flags, not PGI’s OpenACC implementation. For implementation details, see PGI Accelerator Tutorials and Resources.

The Portland Group provides accelerator directive support with their latest C and Fortran compilers. Performance and feature additions are still taking place at a rapid pace but it is currently stable and full featured enough to use in production code.

To make use of the PGI accelerator directives the cuda module and pgi programming environment must be loaded:

$ module load cudatoolkit
$ module load PrgEnv-pgi

To specify the platform that the compiler directives should be applied to the Target Accelerator flag is used:

$ cc -ta=nvidia source.c
$ ftn -ta=nvidia source.f90

PGI Accelerator Tutorials and Resources

PGI provides a useful web portal for Accelerator resources. The portal links to the PGI Fortran & C Accelerator Programming Model which provides a comprehensive overview of the framework and is an excellent starting point. In addition the PGI Accelerator Programming Model on NVIDIA GPUs article series by Michael Wolfe walks you through basic and advanced programming using the framework providing very helpful tips along the way. If you run into trouble PGI has a user forum where PGI staff regularly answer questions.

The OLCF provides Vector Addition and Game of Life example codes demonstrating the PGI accelerator directives.

CapsMC

Note: This section covers environments and compiler flags, not Caps OpenACC implementation. For implementation details, see CapsMC Tutorials and Resources.

The core of CAPS Enterprises GPU directive framework is CapsMC. CapsMC is a compiler and runtime environment that interprets OpenHMPP and OpenACC directives and in conjunction with your traditional compiler (PGI, GNU, Cray or Intel C or Fortran compiler) creates GPU accelerated executables.

To use CAPS accelerator framework you will need the cuda and capsmc modules loaded. Additionally a PGI, GNU, or Intel Programming environment must be enabled.

$ module load cudatoolkit
$ module load capsmc
$ module load PrgEnv-gnu

CapsMC modifies the Cray compiler wrappers, generating accelerator code and then linking it in without any additional flags.

$ cc source.c
$ ftn source.f90

CapsMC Tutorials and Resources

CAPS provides several documents and code snippets to get you started with HMPP Workbench. It is recomended to start off with the HMPP directives reference manual and the HMPPCG reference manual.

The OLCF provides Vector Addition and Game of Life example codes demonstrating the HMPP accelerator directives.

GPU Languages/Frameworks

For complete control over the GPU, Titan supports CUDA C, CUDA Fortran, and OpenCL. These languages and language extensions, while allowing explicit control, are generally more cumbersome than directive-based approaches and must be maintained to stay up-to-date with the latest performance guidelines. Substantial code structure changes may be needed and an in-depth knowledge of the underlying hardware is often necessary for best performance.

NVIDIA CUDA C

NVIDIA’s CUDA C is largely responsible for launching GPU computing to the forefront of HPC. With a few minimal additions to the C programming language, NVIDIA has allowed low-level control of the GPU without having to deal directly with a driver-level API.

To setup the CUDA environment the cudatoolkit module must be loaded:

$ module load cudatoolkit

This module will provide access to NVIDIA supplied utilities such as the nvcc compiler, the CUDA visual profiler(computeprof), cuda-gdb, and cuda-memcheck. The environment variable CUDAROOT will also be set to provide easy access to NVIDIA GPU libraries such as cuBLAS and cuFFT.

To compile we use the NVIDIA CUDA compiler, nvcc.

$ nvcc source.cu

For a full usage walkthrough please see the supplied tutorials.

NVIDIA CUDA C Tutorials and Resources

NVIDIA provides a comprehensive web portal for CUDA developer resources here. The developer documentation center contains the CUDA C programming guide which very thoroughly covers the CUDA architecture. The programming guide covers everything from the underlying hardware to performance tuning and is a must read for those interested in CUDA programming. Also available on the same downloads page are whitepapers covering topics such as Fermi Tuning and CUDA C best practices. The CUDA SDK is available for download as well and provides many samples to help illustrate C for CUDA programming technique. For personalized assistance NVIDIA has a very knowledgeable and active developer forum.

The OLCF provides both a Vector Addition and Game of Life example code tutorial demonstrating CUDA C usage.

PGI CUDA Fortran

PGI’s CUDA Fortran provides a well-integrated Fortran interface for low-level GPU programming, doing for Fortran what NVIDIA did for C. PGI worked closely with NVIDIA to ensure that the Fortran interface provides nearly all of the low-level capabilities of the CUDA C framework.

CUDA Fortran will be properly configured by loading the PGI programming environment:

$ module load PrgEnv-pgi

To compile a file with the cuf extension we use the PGI Fortran compiler as usual:

$ ftn source.cuf

For a full usage walkthrough please see the supplied tutorials.

PGI CUDA Fortran Tutorials and Resources

PGI provides a comprehensive web portal for CUDA Fortran resources here. The portal links to the PGI Fortran & C Accelerator Programming Model which provides a comprehensive overview of the framework and is an excellent starting point. The web portal also features a set of articles covering introductory material, device kernels, and memory management. If you run into trouble PGI has a user forum where PGI staff regularly answer questions.

The OLCF provides both a Vector Addition and Game of Life example code tutorial demonstrating CUDA Fortran usage.

OpenCL

The Khronos group, a non-profit industry consortium, currently maintains the OpenCL (Open Compute Language) standard. The OpenCL standard provides a common low-level interface for heterogeneous computing. At its core, OpenCL is composed of a kernel language extension to C (similar to CUDA C) and a C API to control data management and code execution.

The cuda module must be loaded for the OpenCL header files to be found and a PGI or GNU programming environment enabled:

$ module load PrgEnv-pgi
$ module load cudatoolkit

To use OpenCL you must include the OpenCL library and library path:

gcc -lOpenCL source.c

OpenCL Tutorials and Resources

Khronos provides a web portal for OpenCL. From here you can view the specification, browse the reference pages, and get individual level help from the OpenCL forums. A developers page is also of great use and includes tutorials and example code to get you started.

In addition to the general Khronos provided material users will want to check out the vendor-specific available information for capability and optimization details. Of main interest to OLCF users will be the AMD and NVIDIA OpenCL developer zones.

The OLCF provides both a Vector Addition and Game of Life example code tutorial demonstrating OpenCL usage.