sw-forge Description

Official Forge User Guide



Arm DDT is an advanced debugging tool used for scalar, multi-threaded, and large-scale parallel applications. In addition to traditional debugging features (setting breakpoints, stepping through code, examining variables), DDT also supports attaching to already-running processes and memory debugging. In-depth debugging information is beyond the scope of this guide, and is best answered by the Official DDT User Guide.

Additional DDT Articles

In addition to the information below, the following articles can help you perform specific tasks with DDT:

Launching DDT

The first step in using DDT is to launch the DDT GUI. This can either be launched on the remote machine using X11 forwarding, or by running a remote client on your local machine, and connecting it to the remote machine. The remote client provides a native GUI (for Linux / OS X / Windows) that should be more far more responsive than X11, but requires a little extra setup. It is also useful if you don't have a preconfigured X11 server. To get started using the remote client, follow the DDT Remote Client setup guide. To use X11 forwarding, in a terminal do the following:
$ ssh -X user@<host>.ccs.ornl.gov
$ module load forge
$ ddt &

Running your job

Once you have launched a DDT GUI, we can initiate a debugging session from a batch job script using DDT's "Reverse Connect" functionality. This will connect the debug session launched from the batch script to an already running GUI. This is the most widely applicable method of launching, and allows re-use of any setup logic contained in existing batch scripts. (This method can also be easily modified to launch DDT from an interactive batch session.)
  • Copy or modify an existing job script. (If you don't have an existing job script, you may wish to read the section on letting DDT submit your job to the queue).
  • Include the following near the top of your jobs script:
    $ source $MODULESHOME/init/bash # If not already included, to make the module command available
    module load ddt
  • Finally, prefix the aprun/mpirun with ddt --connect, e.g.:
    $ aprun -n 1024 -N 8 ./myprogram
    $ ddt --connect aprun -n 1024 -N 8 ./myprogram
After submitting this script to the batch system (and waiting for it to be scheduled), a prompt will appear in the DDT GUI asking if you would like to debug this job. DDT Reverse Connect Prompt Once accepted, you can configure some final options before launching your program. DDT Reverse Connect Run Dialog

Attaching to a running job

You can also use DDT to connect to an already-running job. To do so, you must be connected to the system on which the job is running. You do not need to be logged into the job's head node (the node from which aprun/mpirun was launched), but DDT needs to know the head node. The process is fairly simple:
  1. Find your job's head node:
    • On Titan and Eos, run qstat -f <jobid> | grep login_node_id. The node listed is the head node.
    • On other systems, run qstat -f <jobid> | grep exec_host. The _first_ node listed is the head node.
  2. Start DDT by running module load forge and then ddt.
  3. When DDT starts, select the option to "Attach to an already running program".
  4. In that dialog box, make sure the appropriate MPI implementation is selected. If not, click the "Change MPI" button and select the proper one.
  5. If the job's head node is not listed after the word "Hosts", click on "Choose Hosts".
    • Click "Add".
    • Type the host name in the resulting dialog box and click "OK".
    • To make things faster, uncheck any other hosts listed in the dialog box.
    • Click "OK" to return.
  6. Once DDT has finished scanning, your job should appear in the "Automatically-detected jobs" tab, select it and click the "Attach" button.

Letting DDT submit your job to the queue

This method can be useful when using the DDT Remote Client or when your program doesn't have a complex existing launch script.
  1. Run module load forge.
  2. Run ddt.
  3. When the GUI starts, click the "Run and debug a program" button.
  4. In the DDT "Run" dialog, ensure the "Submit to Queue" box is checked.
  5. Optionally select a queue template file (by clicking "Configure" by the "Submit to Queue" box). If your typical job scripts are basically only an aprun command, the default is fine. If your scripts are more complex, you'll need to create your own template file. The default file can be a good start. If you need help creating one, contact help@olcf.ornl.gov or see the DDT User Guide.
  6. Click the "Parameters" button by the "Submit to Queue" box.
  7. In the resulting dialog box, select an appropriate walltime limit, account, and queue. Then click "OK".
  8. Enter your executable in the "Application" box, enter any command line options your executable takes on the "Arguments" line, and select an appropriate number of processes and threads.
  9. Click "Submit". Your job will be submitted to the queue and your debug session will start once the job begins to run. While it's waiting to start, DDT will display a dialog box displaying showq output.

Starting DDT from an interactive-batch job

Note: To tunnel a GUI from a batch job, the -X PBS option should be used to enable X11 forwarding.

Starting DDT from within an interactive job gives you the advantage of running repeated debug sessions with different configurations while only waiting in the queue once.
  1. Start your interactive-batch job with qsub -I -X ... (-X enables X11 forwarding).
  2. Run module load forge.
  3. Start DDT with the command ddt.
  4. When the GUI starts, click the "Run and debug a program" button.
  5. In the DDT "Run" dialog, ensure the "Submit to Queue" box is not checked.
  6. Enter your executable, number of processors, etc.
  7. Click "Run" to run the program.

Memory Debugging (on Cray systems)

In order to use the memory debugging functionality of DDT on Titan, you need to link against the DDT memory debugging library. (On non-Cray systems DDT can preload the shared library automatically if your program uses dynamic linking). In order to link the memory debugging library:
  1. module load ddt (This determines the location of the library to link).
  2. module load ddt-memdebug (This tells the ftn/cc/CC compiler wrappers to link the library).
  3. Re-link your program (e.g. by deleting your binary and running make).
Once re-linked, run your program with DDT, ensuring you enable the "Memory Debugging" option in the run dialog. Memory Debugging Caveats
  • The behavior of ddt-memdebug depends on the current programming environment. For this reason, you may encounter issues if you switch programming environments after ddt-memdebug has been loaded. To avoid this, please ensure that you unload ddt-memdebug before switching programming environments (you can then load it again).
  • The Fortran ALLOCATE function cannot currently be wrapped when using the PGI compiler, so allocations will not be tracked, or protected.
  • When using the Cray compiler, some libraries are compiled in such a way that DDT can not collect a backtrace when allocating memory. In this case, DDT can only show the location (rather than the full backtrace) for when memory is allocated.
For additional information on memory debugging, see the DDT User Guide and/or or how to fix memory leaks with DDT Leak Reports.

Debugging scalar/non-MPI programs (Cray systems)

Launching a debug session on the Cray systems requires the program be linked with the Cray PMI library. (This happens automatically when linking with MPI.) In addition, DDT must be told not to run your program to the MPI_Init function (as it won't be called). If you are using the Cray compiler wrappers, you can load the ddt-non-mpi module (before linking your program) to include the PMI library. The same module should also be loaded prior to running ddt (to tell DDT not to attempt to run to MPI_Init during initialization). Finally, enable the "MPI" option in the DDT run dialog. This will ensure DDT launches your program with aprun. Using the ddt-non-mpi module with the DDT Remote Client When using the DDT Remote Client, we can't load the ddt-non-mpi module in to the client itself. Instead we have three options:
  1. If using "Reverse Connect", load the module before launching ddt --connect ...
  2. Load the ddt-non-mpi module inside your "queue template file" (configured via the "Options" dialog).
  3. Load the module using the "remote script" mechanism while configuring your remote host in the DDT Remote Client. For convenience, you may specify the following file as the remote script: /sw/sources/ddt/ddt-non-mpi.sh.

Fixing Memory Leaks with DDT Leak Reports

Memory leaks occur when memory is allocated, and not correctly freed. This can be particularly problematic if the allocations are large or frequent. Over time, these leaks can degrade performance, or worse, cause the program to fail. DDT's memory debugging features allow analysis of allocated heap memory, both interactively, using the GUI, and non-interactively, using DDT's "offline debugging" mode. The information below will show you how to generate a leak report to pinpoint leaks, and eliminate them. Unlike conventional, interactive debugging, these reports can be created during a batch job, meaning  you do not need to be present at the time your job is scheduled. .

Note: These instructions require DDT 5.0 or above.

Source Code

The source code for this example can be downloaded here. The source code is contained within a git repository with tagged versions "initial", "fix-1" and "fix-2". In addition, the "leak-reports" folder contains example leak reports for the different versions, and two queue submission files are included to launch the example program with and without DDT.

Linking with DDT's Memory Debugging Library

The first step towards creating a memory leak report is to link your program with DDT's memory debugging library. This will intercept calls to memory allocation and release functions (such as malloc and free) and record their location in your program.

Note: Manual linking is required only for Cray systems, or when using static linking.

Linking with DDT's memory debugging library can be automated by loading the ddt-memdebug module. After loading your usual compilation environment, load the following modules:
$ module load forge
$ module load ddt-memdebug
Then re-link your program. How this is done will vary depending on your build system, but it's often sufficient to delete the application binary, and have make regenerate it. For our example:
$ rm mandel
$ make

Launching with DDT

The next step is to launch the program with DDT. As of DDT 5.0, we can prefix our aprun command with the appropriate DDT command. In our example, we can edit submit.qsub to first load the DDT module:
$ source $MODULESHOME/init/bash # Only required if used in a batch script
$ module load forge
and then modify our aprun command so that:
$ aprun -n 16 ./mandel
$ ddt --offline=leak-report.html --mem-debug=fast --check-bounds=off aprun -n 4 ./mandel
The DDT arguments used are as follows:

--offline=leak-report.html: This tells DDT to run in non-interactive "offline" mode, and save the output to leak-report.html

--mem-debug=fast: This enables the memory debugging options in DDT and uses the "fast" preset. (The "fast" present runs the fewest memory checks, in order to reduce overhead).

--check-bounds=off: This disables bounds checking (or "guard pages") in DDT. While this can be useful when tracking down invalid memory accesses, disabling this will reduce the runtime and memory overhead.

(The download bundle also contains a pre-modified version of submit.qsub named submit-ddt.qsub) Now we submit our batch job:
$ qsub -A <projectID> ddt-submit.qsub

Interpretting the Output

Once the job has finished, copy the output file (leak-report.html) to your local machine and open it with a web browser. (Alternatively, open leak-reports/initial.html from the source code download.) Scrolling down to the leak report section, we should see something like this:
Initial Leak Report Chart

Initial Leak Report Chart

For scalability reasons, DDT will limit the report to the 8 ranks with the greatest memory leakage (this can be controlled with the --leak-report-top-ranks command line argument). In the example shown we can see that rank 0 has leaked more memory than the others, and that most of the allocations were created by the Packet::allocate function. Clicking the bar chart item for rank 0 will display a table below showing details of the allocations:
Initial Leak Report Allocations Table

Initial Leak Report Allocations Table

This table shows allocations grouped by the backtrace taken when the allocation was made, along with source code snippets. This information can be used to identify code paths leading to the largest leaks. In our example, the first row of the table represents a single, 16MB allocation, whereas the second row represents 92 smaller allocations, totaling 14.72MB. All of these allocations share the same allocation site (as noted by "#0 Packet::allocate() (packet.cpp:91)" at the top of each stack), but have taken different paths through the code to get there (as shown by different entries further down the stack). Once we have the allocation site (the Packet::allocate member function), the next step is figuring out why this allocation isn't freed. From the source code snippet, we can see that the allocation is assigned to the iterations variable. Reading through the Packet class, we can determine that the iterations allocation is owned by the Packet class, and yet, the Packet::~Packet destructor doesn't contain code to free the allocation. The simple fix here is to add free(pointer); to the destructor. (See"git show fix-1" for more details). After making the fix, running "make" will recompile the code.

Another Leak!

After fixing our leak and recompiling, the next step is to verify the leak is gone, by resubmitting our job and generating a new leak report. Opening the newly-generated leak-report.html (or leak-reports/fix-1.html from the source code download) shows the following:
Leak Report Bar Chart (After Fix 1)

Leak Report Bar Chart (After Fix 1)

While we've fixed one of our leak sites, rank 0 is still leaking around 16MB of memory. Clicking the bar chart item will again show us the allocation details.
Leak Report Allocations Table (After Fix 1)

Leak Report Allocations Table (After Fix 1)

Here we see that the remaining leak was created from a single allocation (again, made in Packet::allocate). As we've already fixed the leak in the Packet class, we should check that the Packet object itself is being correctly freed. We can do this by methodically working our way up the stack. Following the backtrace, we see that Packet::allocate is called by Packet::stitch, which is in turn called by PacketFactory::stitch. The source code snippet for PacketFactory::stitch shows us that Packet::stitch is being called on the packet member object, so let's verify that this is being freed. With a little reading of the source code (e.g. packetfactory.h), we can see that packet is a plain object member of the PacketFactory class, so when a PacketFactory object is freed, packet should be freed too. Let's jump up another level to find the origin of the PacketFactory object itself. strategy1 (found in mandel.cpp) shows the PacketFactory is actually an instance of the derived class SimplePacketFactory, named factory. Hopping up one final time to main, we can see that factory object is passed to stragegy1 as a dereferenced pointer (*factory), and dynamically allocated a little further up the function. The rest of the main function is relatively simple, and we can see that factory isn't directly freed (or passed to any additional function calls where it could be freed). Now we've found the source of the leak, there are a few options to fix it: We could rewrite the code to avoid the dynamic allocation entirely, or wrap the pointer in a C++ auto_ptr(/C++11 unique_ptr), but the simplest solution here is to add "delete factory" once we have finished using the factory (i.e. after the switch statement). See "git show fix-2" for more details. After making our final change, let's recompile and generate one last report to verify our leak has been fixed. Opening leak-report.html (or leak-reports/fix-2.html from the source code download), we see:
Leak Report Bar Chart (After Fix 2)

Leak Report Bar Chart (After Fix 2)

The chart may initially look busier than our other reports, but the total leaked memory is now only 16.75 kB, and the functions responsible are from various system libraries outside of our control. We've now successfully rid our program of the two memory leaks, and improved the correctness of our code. We can also be more confident that our program (at least for the current configuration) is leak-free.


Arm MAP (part of the Arm Forge suite, with DDT) is a profiler for parallel, multithreaded or single threaded C, C++, Fortran and F90 codes. It provides in depth analysis and bottleneck pinpointing to the source line. Unlike most profilers, it's designed to be able to profile pthreads, OpenMP or MPI for parallel and threaded code. MAP aims to be simple to use - there's no need to instrument each source file, or configure.

Linking your program with the MAP Sampler (for Cray systems)

In order to collect information about your program, you must link your program with the MAP sampling libraries. When using shared libraries on non-Cray systems, MAP can do this automatically at runtime. On Cray systems, this process must be performed manually. The map-static-link and map-dynamic-link modules can help with this.
  1. module load forge
  2. module load map-link-static # or map-link-dynamic
  3. Re-compile or re-link your program.

Do I need to recompile?

There's no need to instrument your code with Arm MAP, so there's no strict requirement to recompile. However, if your binary wasn't compiled with the -g compiler flag, MAP won't be able to show you source-line information, so recompiling would be beneficial.

Note: If using the Cray compiler, you may wish to use -G2 instead of -g. This will prevent the compiler from disabling most optimizations, which could affect runtime performance.

Generating a MAP output file

Arm MAP can generate a profile using the GUI or the command line. The GUI option should look familiar to existing users of DDT, whereas the command line option may offer the smoothest transition when moving from an existing launch configuration. MAP profiles are small in size, and there's generally no configuration required other than your existing aprun command line. To generate a profile using MAP, take an existing queue submission script and modify to include the following:
source $MODULESHOME/init/bash # May already be included if using modules
module load forge
And then add a prefix your aprun command so that:
aprun -n 128 -N 8 ./myapp a b c
would become:
map --profile aprun -n 128 -N 8 ./myapp a b c
Once your job has completed running, the program's working directory should contain a timestamped .map file such as myapp_1p_1t_2016-01-01_12-00.map.

Profiling a subset of your application

To profile only a subset of your application, you can either use the --start-after=TIME and its command line options (see map --help for more information), or use the API to have your code tell MAP when to start and stop sampling, as detailed here.

Viewing a MAP profile

Once you have collected a profile, you can then view the information using the map command, either by launching and choosing "Load Profile Data File", or by specifying the file on the command line e.g.
map ./myapp_1p_1t_2016-01-01_12-00.map
(The above will require a SSH connection with  X11 forwarding, or other remote graphics setup.) An alternative that provides a local, native GUI (for Linux, OS X, or Windows) is to install the Arm Forge Remote Client on your local machine. This client is able to load and view profiles locally (useful when working offline), or remotely (which avoids the need to copy the profile data and corresponding source code to your local machine). The remote client can be used for both Arm DDT and Arm MAP. For more information on how to install and configure the remote client, see the remote client setup page. For more information see the Arm Forge user guide (also available via the "Help" menu in the MAP GUI).

Additional Arm MAP resources



  • 6.0.3-47102
  • 6.0.5-47435
  • 6.0.5-rc1
  • 6.0.6
  • 6.0.6-47644
  • 6.1
  • 6.1.1
  • 6.1-47845
  • 7.0
  • 7.0.1
  • 7.0.5
  • 7.1


  • 6.0.3-47102
  • 7.0.1
  • 7.0.3
  • 7.0.5


  • 6.0.3
  • 7.0.1
  • 7.0.5
  • 7.1