Workshop for porting applications to Titan using Cray OpenACC and Cray Performance Tools
This three day workshop is designed to be a hands-on workshop for users of Titan to port and optimize their applications to Titan. The workshop will include the use of Cray CCE OpenACC compiler and tools. Each morning will be lecture/demonstrations of utilizing the tools and compiler and the afternoon will be dedicated to attendees accessing the Titan system to exercise the tools and compiler. The workshop will concentrate on a fifteen step process. Whether OpenACC is used for ultimate accelerator utilization – this work must be performed to refactor the application to achieve good performance on any hybrid-multicore system.
1) First and foremost – Profile the application
a) Must identify looping structure within the time step loop
b) Use -h profile_generate on compile and -Ocalltree or -Ocallers on craypat
2) Use Reveal to identify scoping of variables in the major loop – may call subroutines and functions
a) The idea is to first generate OpenMP version of the loop and then add some OpenACC
3) Use OpenACC to identify data motion require to run with companion accelerator
a) Once scoping is obtained, the OpenACC compiler will indicate what data would need to be moved to run on the accelerator – user must have the variable scoping correct
4) Once one loop is analyzed, now look at next highest compute loop, perform steps 2 and 3.
5) Soon multiple loops can be combined within a OpenACC data region for eliminating transfers to and from the host.
6) Work outward until a data region encompasses a communication, I/O or looping structure more suited for the host
a) Must use updates to move data to and from the host to supply host with up-to-date data
7) Move data region outside time step loop
a) Now must account for all updates to keep host and accelerator with consistent data
8) Test versions after each step – don’t worry about performance yet – just accuracy
9) The compiler may introduce data transfer so look at -rm listing for each individual OpenACC loop.
10) Optimize/Minimize data transfers first by using present on data clause.
11) Gather perftools statistics on code and identify bottlenecks
12) If bottleneck is data copies look at step 9
13) If bottleneck is kernel performance
a) Look at -rm and see what the compiler did to optimize the loop
b) Ideally we want three levels of parallelism, gang, worker, vector
c) Inner loop needs to be g on listing
d) If inner loop is indicated by a loop level, that means that it is running in scalar – BAD
e) Consider incorporating accelerator library calls
f) Consider incorporating CUDA routines
14) Consider introducing CUDA streams
a) Either by taking an outer loop that cannot be parallelized due to communication and running that in a streaming mode
b) Taking several independent operations and running that in a stream mode
15) Start looking at timelines showing communication, host execution and accelerator
a) What can be overlapped
On-site registration is closed to off-site visitors.
Registration is now closed.
Please email me (firstname.lastname@example.org) if you plan on making reservations at the ORNL Guest House. I need to fill out different forms to give you access to the lab. Thanks, Sherry Ray
Workshop materials can be found here:
Social Chat can be found here: