Compiling code on Titan (and other Cray machines) is different than compiling code for commodity or beowulf-style HPC linux clusters. Among the most prominent differences:
- Cray provides a sophisticated set of compiler wrappers to ensure that the compile environment is setup correctly. Their use is highly encourged.
- In general, linking/using shared object libraries on compute partitions is not supported.
- Cray systems include many different types of nodes, so some compiles are, in fact, cross-compiles.
The following compilers are available on Titan:
- PGI, the Portland Group Compiler Suite (default)
- GCC, the GNU Compiler Collection
- CCE, the Cray Compiling Environment
- Intel, Intel Composer XE
Upon login, the default versions of the PGI compiler and associated Message Passing Interface (MPI) libraries are added to each user’s environment through a programming environment module. Users do not need to make any environment changes to use the default version of PGI and MPI.
Cray Compiler Wrappers
Cray provides a number of compiler wrappers that substitute for the traditional compiler invocation commands. The wrappers call the appropriate compiler, add the appropriate header files, and link against the appropriate libraries based on the currently loaded programming environment module. To build codes for the compute nodes, you should invoke the Cray wrappers via:
|cc||To use the C compiler|
|CC||To use the C++ compiler|
|ftn||To use the FORTRAN compiler|
-craype-verbose option can be used to view the compile line built by the compiler wrapper:
titan-ext$ cc -craype-verbose ./a.out pgcc -tp=bulldozer -Bstatic ...
Compiling and Node Types
Titan is comprised of different types of nodes:
- Login nodes running traditional Linux
- Service nodes running traditional Linux
- Compute nodes running the Cray Node Linux (CNL) microkernel
The type of work you are performing will dictate the type of node for which you build your code.
Compiling for Compute Nodes (Cross Compilation)
Titan compute nodes are the nodes that carry out the vast majority of computation on the system. Compute nodes are running the CNL microkernel, which is markedly different than the OS running on the login and service nodes. Most code that runs on Titan will be built targeting the compute nodes.
All parallel codes should run on the compute nodes. Compute nodes are accessible only by invoking
aprun within a batch job. To build codes for the compute nodes, you should use the Cray compiler wrappers:
titan-ext$ cc code.c titan-ext$ CC code.cc titan-ext$ ftn code.f90
ftncompiler wrappers be used when compiling and linking source code for use on Titan compute nodes.
Support for Shared Object Libraries
On Titan, and Cray machines in general, statically linked executables will perform better and are easier to launch. Depending on the module files you load, certain Cray-provided modules and libraries (such as
cudart) may employ and configure dynamic linking automatically; the following warnings do not apply to them.
cudart. Any necessary dynamic linking of these libraries will be configured automatically by the Cray compiler wrappers.
If you must use shared object libraries, you will need to copy all necessary libraries to a Lustre scratch area (
$PROJWORK) or the NFS
/ccs/proj area and then update your
LD_LIBRARY_PATH environment variable to include this directory. Due to the meta-data overhead, /ccs/proj is the suggested area for shared objects and python modules. For example, the following command will append your project’s NFS home area to the
LD_LIBRARY_PATH in bash:
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/ccs/proj/[projid]
Compiling with shared libraries can be further complicated by the fact that Titan’s login nodes do not run the same operating system as the compute nodes, and thus many shared libraries are available on the login nodes which are not available on the compute nodes. This means that an executable may appear to compile correctly on a login node, but will fail to start on a compute node because it will be unable to locate the shared libraries it needs.
It may appear that this could be resolved by locating the shared libraries on the login node and copying them to Lustre or /ccs/proj for use on the compute nodes. This is inadvisable because these shared libraries were not compiled for the compute nodes, and may perform erratically. Also, referring directly to these libraries circumvents the
module system, and may jeopardize your deployment environment in the event of system upgrades.
For performance considerations, it is important to bear in mind that each node in your job will need to search through
$LD_LIBRARY_PATH for each missing dynamic library, which could cause a bottleneck with the Lustre Metadata Server. Lastly, calls to functions in dynamic libraries will not benefit from compiler optimizations that are available when using static libraries.
Compiling for Login or Service Nodes
When you log into Titan you are placed on a login node. When you submit a job for execution, your job script is initially launched on one of a small number of shared service nodes. All tasks not launched through
aprun will run on the service node. Users should note that there are a small number of these login and service nodes, and they are shared by all users. Because of this, long-running or memory-intensive work should not be performed on login nodes nor service nodes.
ftn your code will be built for the compute nodes by default. If you wish to build code for the Titan login nodes or service nodes, you must do one of the following:
- Add the
-target-cpu=mc8flag to your
- Use the
module swap craype-interlagos craype-mc8
- Call the underlying compilers directly (e.g.
- Use the
craype-network-nonemodule to remove the network and MPI libraries:
module swap craype-network-gemini craype-network-none
XK7 Service/Compute Node Incompatibilities
On the Cray XK7 architecture, service nodes differ greatly from the compute nodes. The difference between XK7 compute and service nodes may cause cross compiling issues that did not exist on Cray XT5 systems and prior.
For XK7, login and service nodes use AMD’s Istanbul-based processor, while compute nodes use the newer Interlagos-based processor. Interlagos-based processors include instructions not found on Istanbul-based processors, so executables compiled for the compute nodes will not run on the login nodes nor service nodes; typically crashing with an illegal instruction error. Additionally, codes compiled specifically for the login or service nodes will not run optimally on the compute nodes.
Optimization Target Warning
Because of the difference between the login/service nodes (on which code is built) and the compute nodes (on which code is run), a software package’s build process may inject optimization flags incorrectly targeting the login/service nodes. Users are strongly urged to check makefiles for CPU-specific optimization flags (ex: -tp, -hcpu, -march). Users should not need to set such flags; the Cray compiler wrappers will automatically add CPU-specific flags to the build. Choosing the incorrect processor optimization target can negatively impact code performance.
If a different compiler is required, it is important to use the correct environment for each compiler. To aid users in pairing the correct compiler and environment, programming environment modules are provided. The programming environment modules will load the correct pairing of compiler version, message passing libraries, and other items required to build and run. We highly recommend that the programming environment modules be used when changing compiler vendors.
The following programming environment modules are available on Titan:
- PrgEnv-pgi (default)
To change the default loaded PGI environment to the default GCC environment use:
$ module unload PrgEnv-pgi $ module load PrgEnv-gnu
$ module swap PrgEnv-pgi PrgEnv-gnu
Changing Versions of the Same Compiler
To use a specific compiler version, you must first ensure the compiler’s PrgEnv module is loaded, and then swap to the correct compiler version. For example, the following will configure the environment to use the GCC compilers, then load a non-default GCC compiler version:
$ module swap PrgEnv-pgi PrgEnv-gnu $ module swap gcc gcc/4.6.1
- Do not purge all modules; rather, use the default module environment provided at the time of login, and modify it.
- Do not swap or unload any of the Cray provided modules (those with names like
- Do not swap moab, torque, or MySQL modules after loading a programming environment modulefile.
Compiling Threaded Codes
When building threaded codes on Cray machines, you may need to take additional steps to ensure a proper build.
For PGI, add “-mp” to the build line.
$ cc -mp test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
For GNU, add “-fopenmp” to the build line.
$ cc -fopenmp test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
For Cray, no additional flags are required.
$ module swap PrgEnv-pgi PrgEnv-cray $ cc test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
For Intel, -qopenmp
$ module swap PrgEnv-pgi PrgEnv-intel $ cc -qopenmp test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
Running with OpenMP and PrgEnv-intel
An extra thread created by the Intel OpenMP runtime interacts with the CLE thread binding mechanism and causes poor performance. To work around this issue, CPU-binding should be turned off. This is only an issue for OpenMP with the Intel programming environment.
How CPU-binding is shut off depends on how the job is placed on the node. In the following examples, we refer to the number of threads per MPI task as depth; this is controlled by the
-d option to
aprun. We refer to the number of MPI tasks or processing elements per socket as npes; this is controlled by the
-n option to
aprun. In the following examples, replace
depth with the number of threads per MPI task, and
npes with the number of MPI tasks (processing elements) per socket that you plan to use.
For the case of running when
depth divides evenly into the number of processing elements on a socket (
$ export OMP_NUM_THREADS="<=depth" $ aprun -n npes -d "depth" -cc numa_node a.out
For the case of running when
depth does not divide evenly into the number of processing elements on a socket (
$ export OMP_NUM_THREADS="<=depth" $ aprun -n npes -d “depth” -cc none a.out
For SHMEM codes, users must load the
cray-shmem module before compiling:
$ module load cray-shmem
Accelerator Compiler Directives
Accelerator compiler directives allow the compiler, guided by the programmer, to take care of low-level accelerator work. One of the main benefits of a directives-based approach is an easier and faster transition of existing code compared to low-level GPU languages. Additional benefits include performance enhancements that are transparent to the end developer and greater portability between current and future many-core architectures. Although initially several vendors provided their own set of proprietary directives OpenACC has now provided a unified specification for accelerator directives.
OpenACC aims to provide an open accelerator interface consisting primarily of compiler directives. Currently PGI, Cray, and CapsMC provide OpenACC implementations for C/C++ and Fortran. OpenACC aims to provide a portable cross platform solution for accelerator programming.
Using OpenACC with C/C++
$ module load cudatoolkit $ cc -acc vecAdd.c -o vecAdd.out
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ cc -h pragma=acc vecAdd.c -o vecAdd.out
$ module load cudatoolkit capsmc $ cc vecAdd.c -o vecAdd.out
Using OpenACC with Fortran
$ module load cudatoolkit $ ftn -acc vecAdd.f90 -o vecAdd.out
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ ftn -h acc vecAdd.f90 -o vecAdd.out
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit capsmc $ ftn vecAdd.f90 -o vecAdd.out
OpenACC Tutorials and Resources
The OpenACC specification provides the basis for all OpenACC implementations and is available OpenACC specification . In addition the implementation specific documentation may be of use. PGI has a site dedicated to collecting OpenACC resources. Chapter 5 of the Cray C and C++ Reference Manual provides details on Crays implementation. CapsMC has provided an OpenACC Reference Manual.
The Portland Group provides accelerator directive support with their latest C and Fortran compilers. Performance and feature additions are still taking place at a rapid pace but it is currently stable and full featured enough to use in production code.
To make use of the PGI accelerator directives the cuda module and pgi programming environment must be loaded:
$ module load cudatoolkit $ module load PrgEnv-pgi
To specify the platform that the compiler directives should be applied to the Target Accelerator flag is used:
$ cc -ta=nvidia source.c $ ftn -ta=nvidia source.f90
PGI Accelerator Tutorials and Resources
PGI provides a useful web portal for Accelerator resources. The portal links to the PGI Fortran & C Accelerator Programming Model which provides a comprehensive overview of the framework and is an excellent starting point. In addition the PGI Accelerator Programming Model on NVIDIA GPUs article series by Michael Wolfe walks you through basic and advanced programming using the framework providing very helpful tips along the way. If you run into trouble PGI has a user forum where PGI staff regularly answer questions.
The core of CAPS Enterprises GPU directive framework is CapsMC. CapsMC is a compiler and runtime environment that interprets OpenHMPP and OpenACC directives and in conjunction with your traditional compiler (PGI, GNU, Cray or Intel C or Fortran compiler) creates GPU accelerated executables.
To use CAPS accelerator framework you will need the cuda and capsmc modules loaded. Additionally a PGI, GNU, or Intel Programming environment must be enabled.
$ module load cudatoolkit $ module load capsmc $ module load PrgEnv-gnu
CapsMC modifies the Cray compiler wrappers, generating accelerator code and then linking it in without any additional flags.
$ cc source.c $ ftn source.f90
CapsMC Tutorials and Resources
For complete control over the GPU, Titan supports CUDA C, CUDA Fortran, and OpenCL. These languages and language extensions, while allowing explicit control, are generally more cumbersome than directive-based approaches and must be maintained to stay up-to-date with the latest performance guidelines. Substantial code structure changes may be needed and an in-depth knowledge of the underlying hardware is often necessary for best performance.
NVIDIA CUDA C
NVIDIA’s CUDA C is largely responsible for launching GPU computing to the forefront of HPC. With a few minimal additions to the C programming language, NVIDIA has allowed low-level control of the GPU without having to deal directly with a driver-level API.
To setup the CUDA environment the
cudatoolkit module must be loaded:
$ module load cudatoolkit
This module will provide access to NVIDIA supplied utilities such as the nvcc compiler, the CUDA visual profiler(computeprof), cuda-gdb, and cuda-memcheck. The environment variable CUDAROOT will also be set to provide easy access to NVIDIA GPU libraries such as cuBLAS and cuFFT.
To compile we use the NVIDIA CUDA compiler, nvcc.
$ nvcc source.cu
For a full usage walkthrough please see the supplied tutorials.
NVIDIA CUDA C Tutorials and Resources
NVIDIA provides a comprehensive web portal for CUDA developer resources here. The developer documentation center contains the CUDA C programming guide which very thoroughly covers the CUDA architecture. The programming guide covers everything from the underlying hardware to performance tuning and is a must read for those interested in CUDA programming. Also available on the same downloads page are whitepapers covering topics such as Fermi Tuning and CUDA C best practices. The CUDA SDK is available for download as well and provides many samples to help illustrate C for CUDA programming technique. For personalized assistance NVIDIA has a very knowledgeable and active developer forum.
PGI CUDA Fortran
PGI’s CUDA Fortran provides a well-integrated Fortran interface for low-level GPU programming, doing for Fortran what NVIDIA did for C. PGI worked closely with NVIDIA to ensure that the Fortran interface provides nearly all of the low-level capabilities of the CUDA C framework.
CUDA Fortran will be properly configured by loading the PGI programming environment:
$ module load PrgEnv-pgi
To compile a file with the cuf extension we use the PGI Fortran compiler as usual:
$ ftn source.cuf
For a full usage walkthrough please see the supplied tutorials.
PGI CUDA Fortran Tutorials and Resources
PGI provides a comprehensive web portal for CUDA Fortran resources here. The portal links to the PGI Fortran & C Accelerator Programming Model which provides a comprehensive overview of the framework and is an excellent starting point. The web portal also features a set of articles covering introductory material, device kernels, and memory management. If you run into trouble PGI has a user forum where PGI staff regularly answer questions.
The Khronos group, a non-profit industry consortium, currently maintains the OpenCL (Open Compute Language) standard. The OpenCL standard provides a common low-level interface for heterogeneous computing. At its core, OpenCL is composed of a kernel language extension to C (similar to CUDA C) and a C API to control data management and code execution.
The cuda module must be loaded for the OpenCL header files to be found and a PGI or GNU programming environment enabled:
$ module load PrgEnv-pgi $ module load cudatoolkit
To use OpenCL you must include the OpenCL library and library path:
gcc -lOpenCL source.c
OpenCL Tutorials and Resources
Khronos provides a web portal for OpenCL. From here you can view the specification, browse the reference pages, and get individual level help from the OpenCL forums. A developers page is also of great use and includes tutorials and example code to get you started.
In addition to the general Khronos provided material users will want to check out the vendor-specific available information for capability and optimization details. Of main interest to OLCF users will be the AMD and NVIDIA OpenCL developer zones.