Getting Performance on Titan

I have some OpenACC code that runs and gives the correct result for a production science code on the Titan supercomputer. The issue is that the performance is very bad. In fact, it’s much worse than the performance I get on my local workstation with a commodity gamer GPU. I’m hoping the forum may be able to help me understand what’s going on.

Being a production science code, it’s quite dense. I’m going to try providing some snippets and see if that’s enough to make progress. If not, I can try showing more of the code.

Here’s the primary compute region/loop:

      !$acc data copyin(sold(lo(1):hi(1),lo(2):hi(2),lo(3):hi(3),:))           &
      !$acc      copyin(tempbar_init(0:hi(3)), ldt)                                 &
      !$acc      copyout(snew(lo(1):hi(1),lo(2):hi(2),lo(3):hi(3),:))          &
      !$acc      copyout(rho_omegadot(lo(1):hi(1),lo(2):hi(2),lo(3):hi(3),:) ) &
      !$acc      copyout(rho_Hnuc(lo(1):hi(1),lo(2):hi(2),lo(3):hi(3)))        &
      !$acc      copyout(rho_Hext(lo(1):hi(1),lo(2):hi(2),lo(3):hi(3)))

      !$acc parallel loop gang vector collapse(3) private(rho,x_in,T_in,x_test,x_out) &
      !$acc    private(rhowdot,rhoH,sumX,n)
      do k = lo(3), hi(3)
         do j = lo(2), hi(2)
            do i = lo(1), hi(1)
               rho = sold(i,j,k,rho_comp)
               x_in = sold(i,j,k,spec_comp:spec_comp+nspec-1) / rho

               if (drive_initial_convection) then
                  T_in = tempbar_init(k)
                  T_in = sold(i,j,k,temp_comp)
               if (ispec_threshold > 0) then
                  x_test = x_in(ispec_threshold)
                  x_test = ZERO
               if (rho > burning_cutoff_density .and.                &
                    ( ispec_threshold < 0 .or.                       &
                    (ispec_threshold > 0 .and.                       &
                    x_test > burner_threshold_cutoff))) then
                  call burner(rho, T_in, x_in, ldt, x_out, rhowdot, rhoH)
                  x_out = x_in
                  rhowdot = 0.d0
                  rhoH = 0.d0
               ! check if sum{X_k} = 1
               sumX = ZERO
               do n = 1, nspec
                  sumX = sumX + x_out(n)
               snew(i,j,k,rho_comp) = sold(i,j,k,rho_comp)
               snew(i,j,k,pi_comp) = sold(i,j,k,pi_comp)
               snew(i,j,k,spec_comp:spec_comp+nspec-1) = x_out(1:nspec) * rho
               rho_omegadot(i,j,k,1:nspec) = rhowdot(1:nspec)
               rho_Hnuc(i,j,k) = rhoH
               snew(i,j,k,rhoh_comp) = sold(i,j,k,rhoh_comp) &
                    + ldt*rho_Hnuc(i,j,k) + ldt*rho_Hext(i,j,k)
               snew(i,j,k,trac_comp:trac_comp+ntrac-1) = &

      !$acc end parallel
      !$acc end data

The subroutine burner is marked up as !$acc routine seq, as are all routines it calls. Here’s the loop that ultimately does most of the work in each sequential thread:

  subroutine bdf_advance(ts, neq, npt, y0, t0, y1, t1, dt0, reset, reuse, ierr, initial_call)
    !$acc routine seq
    type(bdf_ts), intent(inout) :: ts
    integer,      intent(in   ) :: neq, npt
    real(dp_t),   intent(in   ) :: y0(neq,npt), t0, t1, dt0
    real(dp_t),   intent(  out) :: y1(neq,npt)
    logical,      intent(in   ) :: reset, reuse
    integer,      intent(  out) :: ierr
    logical,      intent(in   ) :: initial_call
    integer  :: k, p, m, n
    logical  :: retry, linitial

    if (reset) call bdf_reset(ts, y0, dt0, reuse)
    ierr = BDF_ERR_SUCCESS

    ts%t1 = t1; ts%t = t0; ts%ncse = 0; ts%ncdtmin = 0;
    do k = 1, bdf_max_iters + 1
       call bdf_update(ts)                ! update various coeffs (l, tq) based on time-step history
       call bdf_predict(ts)               ! predict nordsieck array using pascal matrix
       call bdf_solve(ts)         ! solve for y_n based on predicted y and yd
       call bdf_check(ts, retry, ierr)    ! check for solver errors and test error estimate

       if (.not. retry) then
          call bdf_correct(ts)               ! new solution looks good, correct history and advance
          call bdf_adjust(ts)                ! adjust step-size/order
       if (ts%t >= t1 .or. ierr /= BDF_ERR_SUCCESS) exit
    end do

    if (ts%n > ts%max_steps .or. k > bdf_max_iters) then
       ierr = BDF_ERR_MAXSTEPS
    end if
    do p = 1, ts%npt
       do m = 1, ts%neq
          y1(m,p) = ts%z0(m,p,0)
       end do
    end do
  end subroutine bdf_advance

The compilation command and output for the primary OpenACC region looks like this:

ftn  -module t/Linux.PGI.acc/m -It/Linux.PGI.acc/m -acc -Minfo=acc  -I/ccs/home/ajacobs/Codebase/Microphysics/eos/helmholtz  -c -o t/Linux.PGI.acc/o/react_state.o /ccs/home/ajacobs/Codebase/MAESTRO/Source/react_state.f90
    764, Generating copyin(sold(lo:hi,lo:hi,lo:hi,:),tempbar_init(0:hi),ldt,rho_hext(lo:hi,lo:hi,lo:hi))
         Generating copyout(snew(lo:hi,lo:hi,lo:hi,:),rho_omegadot(lo:hi,lo:hi,lo:hi,:),rho_hnuc(lo:hi,lo:hi,lo:hi))
    771, Accelerator kernel generated
         Generating Tesla code
        775, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
        776,   ! blockidx%x threadidx%x collapsed
        777,   ! blockidx%x threadidx%x collapsed
        785, !$acc loop seq
        821, !$acc loop seq
        836, !$acc loop seq
        842, !$acc loop seq
        852, !$acc loop seq
    785, Loop is parallelizable
    821, Loop is parallelizable
    836, Loop carried reuse of snew prevents parallelization
    842, Loop is parallelizable
    852, Loop carried reuse of snew prevents parallelization

Even when this triple loop is 64**3, this code runs slower than the version compiled without OpenACC and run in serial. Does anyone have insight into why it runs so slow? From PGI_ACC_NOTIFY=3 I know there’s not any data thrashing. There’s just some upload before the loop and download after, and only done with data that is needed. The compute just seems to be running very slow and it’s very hard to profile as to why.

Hi Adam,

One issue I’ve seen on Titan is that if you have the Craypat module (profiler) loaded, the Cray compiler drivers (cc, ftn) will implicitly add “-g” (debugging support) to the PGI compilation. “-g” will reduce optimization. One group I worked with saw a 16x slowdown fixed when they unloaded this module.

Other thoughts is that Titan uses K20s which are CC20 and CUDA 7.0. If your home box uses a CC35 device, the read-only arrays may be put in textured memory. While it shouldn’t be a huge difference in performance, it is a difference.

Have you profiled the code on both systems to see where the performance difference occurs? GPU performance? CPU performance? (note Titan uses older AMD Barcelona chips)

  • Mat

Hi Mat,

Thanks for your help!

Here are the modules I have loaded when compiling. Craypat’s not loaded.

Currently Loaded Modulefiles:
  1) eswrap/1.1.0-1.020200.1231.0          21) python_pygobject/2.12.3
  2) craype-network-gemini                 22) python_pycairo/1.2.0
  3) cray-mpich/7.2.5                      23) python_matplotlib/1.2.1
  4) craype-interlagos                     24) python_scipy/0.12.0
  5) lustredu/1.4                          25) python_setuptools/21.0
  6) module_msg/0.1                        26) python_ipython/3.0.0
  7) modulator/1.2.0                       27) python_zmq/14.4.1
  8) hsi/5.0.2.p1                          28) cray-libsci/13.2.0
  9) DefApps                               29) udreg/2.3.2-1.0502.10518.2.17.gem
 10) site-aprun/1.0                        30) ugni/6.0-1.0502.10863.8.28.gem
 11) aprun-usage/1.0                       31) pmi/5.0.9-1.0000.10911.175.4.gem
 12) altd/1.0                              32) dmapp/7.0.1-1.0502.11080.8.74.gem
 13) git/2.3.2                             33) gni-headers/4.0-1.0502.10859.7.8.gem
 14) gcc/4.9.0                             34) xpmem/0.1-2.0502.64982.5.3.gem
 15) pgi/16.4.lustre                       35) dvs/2.5_0.9.0-1.0502.2188.1.113.gem
 16) craype/2.4.2                          36) alps/5.2.4-2.0502.9774.31.12.gem
 17) cray-hdf5/1.8.14                      37) rca/1.0.0-2.0502.60530.1.63.gem
 18) python/2.7.9                          38) atp/1.8.3
 19) python_numpy/1.7.1                    39) PrgEnv-pgi/5.2.82
 20) python_pygtk/2.10.6                   40) cudatoolkit/7.0.28-1.0502.10280.4.1

I would like to profile execution on the GPU, but am not sure how best to go about doing so. Do you have recommendations? Especially for seeing memory usage patterns? I must be missing something. I can’t believe that the GPU performance is so slow if I manage to run the code properly.

Should I expect major performance hits if each CUDA thread is executing several levels of seq routines? How devastating are if statements to performance?

Hi Adam,

There are several ways to get profiling information.

First you can try setting the environment variable “PGI_ACC_TIME=1”. This will cause the PGI runtime to print out basic aggregated GPU performance for each of your kernels and data regions.

Another environment variable to try is “COMPUTE_PROFILE=1”. This will dump to a log file timing information for every data copy and kernel invocation. You can also create a config file where you can set which hardware profiler counters you want to use. Details can be found at:

Since you have PGI 16.4 loaded, another method is to use PGI’s pgprof profiler. This will give you aggregated GPU and CPU performance. You can also have the profile logged to an output file (via the "-o " flag). This file can then be imported to either NVVP or pgprof using a GUI, to see the GPU timeline information.

Should I expect major performance hits if each CUDA thread is executing several levels of seq routines?

This will have a negative impact on performance and if possible, you should try to have the compiler inline these routines.

How devastating are if statements to performance?

It depends. If all the threads within the same warp take the same branch, then they have no impact on performance. If the threads take different paths, then you’ll have branch divergence and have at least a N times slow down for this section of code, where N is the number of branches.

  • Mat

My use of pgprof doesn’t tell me too much more than something like PGI_ACC_NOTIFY. Basically it tells me the grid layout and that the kernel takes a real long time, but it doesn’t give me more granular information about what’s happening within the kernel (e.g., which acc routine is it spending lots of time in, or is there significant branching happening). Is there a way to get such information?

The GPU code is running one-two orders of magnitude slower than serial CPU. One thing I noticed when I did a CPU profile is that the triply nested loop I mention before will execute 64*3 iterations, but each of these can call subroutines 10-100 times. Could this cause such a profound slowdown?

Is there a way to get such information?

The CUDA command line profiler can get you hardware counters. Follow the above link for details.

Could this cause such a profound slowdown?

Possible. Have you tried inlining?

Though I would expect the same issue would occur on your gamer card and why I’m leaning more towards things like textured memory. Could be the CUDA version as well. What kind of card is your gamer GPU? What CUDA version does it use?

  • Mat

Thanks again for your help. I’m following up with the link to see if I can get some more insight into what’s happening on the GPU. In the meantime, this is an example of the what I get from pgprof:

[ajacobs@titan-login2 oac_test_react]$ pgprof -i pgprof.out --print-openacc-trace --print-gpu-trace --print-api-trace
======== CPU profiling result (bottom up):
Time(%)      Time  Name
 49.97%  24.9993s  cudbgGetAPIVersion
 49.97%  24.9993s  | start_thread
 49.97%  24.9993s  |   clone
 43.54%  21.7837s  cuStreamSynchronize
 43.54%  21.7837s  | __pgi_uacc_cuda_wait
 43.54%  21.7837s  |   __pgi_uacc_computedone
 43.54%  21.7837s  |     react_state_module_burner_loop_3d_
 43.54%  21.7837s  |       react_state_module_burner_loop_
 43.54%  21.7837s  |         react_state_module_react_state_
 43.54%  21.7837s  |           varden_
 43.54%  21.7837s  |             MAIN_
 43.54%  21.7837s  |               main
  1.33%   665.3ms  getc
  1.33%   665.3ms  | read_record
  1.33%   665.3ms  |   __f90io_ldr
  1.33%   665.3ms  |     pgf90io_ldr
  1.33%   665.3ms  |       actual_eos_module_actual_eos_init_
  1.33%   665.3ms  |         eos_module_eos_init_
  1.33%   665.3ms  |           varden_
  1.33%   665.3ms  |             MAIN_
  1.33%   665.3ms  |               main

I would like to get more granular information about what’s happening within react_state_module_burner_loop_3d_. I’ll see if the CUDA tool you mention can help with this.

Possible. Have you tried inlining?

I’ve tried playing with it a bit, but I’m struggling. As I understand it, if your routines are spread across multiple files you need to use the -Mipa=inline option. The difficulty here is that I want to inline only specific functions and I haven’t figured out how to do that yet. The answer may be that I need to make a local copy of routines I want to inline within the relevant files, and I can then use -Minine=func1,func2. Alternatively, I can try having a very long list of except:func1,func2,... for -Mipa=inline I’ll try to explore this.

What kind of card is your gamer GPU? What CUDA version does it use?

Here’s the output of pgaccelinfo:

CUDA Driver Version:           7050
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  358.16  Mon Nov 16 19:25:55 PST 2015

Device Number:                 0
Device Name:                   GeForce GTX 960
Device Revision Number:        5.2
Global Memory Size:            4294246400
Number of Multiprocessors:     8
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1278 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             3505 MHz
Memory Bus Width:              128 bits
L2 Cache Size:                 1048576 bytes
Max Threads Per SMP:           2048
Async Engines:                 2
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc50

Your gamer card is a Maxwell which is several generations newer than the K20 on Titan and is significantly faster for single precision codes. It also has much higher memory bandwidth. If your code is memory bound, this could explain the difference.

Note that your pgprof output you posted is the CPU profile, not the GPU. “cuStreamSynchronize” is where the CPU is waiting for the GPU. The GPU profile should be printed later in the output.

For inlining across files, -Mipa=inline is one method. The second is to perform a two pass compile with “-Mextract=lib:” set on the first pass and “-Minline=lib:” on the second. The first pass extracts the inline information and stores it to a inline library directory. IPA essentially does the same thing but performs the second pass at link time and stores the extract information in the “*.oo” files. Detailed information can be found in chapter 4 of the PGI User’s Guide (

Just a quick note: pgprof on Titan seems to only give me CPU profiling information. I can’t seem to get GPU info, even when I pass in GPU-specific flags.

I tried the inlining method of extracting routines called several times and then including them with I had to use reshape, and according to -Minfo=inline the inlining was successful. I tried various configurations, but in all cases the code was a bit slower with inlining.

I did some profiling using the command line profiler, but could not get useful insight from it. Here’s the output comparing a prototype code that runs faster on the GPU and the production code I’m working on now:

method=[ react_84_gpu ] gputime=[ 3628653.750 ] cputime=[ 3628676.500 ] occupancy=[ 0.188 ] active_cycles=[ 2650675947 ] sm_cta_launched=[ 61  ] warps_launched=[ 244 ] threads_launched=[ 7808  ] 
method=[ burner_loop_ ] gputime=[ 24348482.00 ] cputime=[ 24349284.00 ] occupancy=[ 0.125 ] active_cycles=[ -1         ] sm_cta_launched=[ 151 ] warps_launched=[ 604 ] threads_launched=[ 19328 ]

For the production code, active_cycles is -1, which I’m not sure how to interpret. If you have any suggested counters that could help me understand what’s going on, or if the above data tells you anything, please let me know.

Aah. This is because you only have trace information enabled. Add " --print-gpu-summary" for the GPU kernel times or “–print-summary” (or “-s”) for the GPU, OpenACC API calls, and CPU time summaries.

Alternatively you can output your profile, "-o " and then import it, "-i " to view it. The import view will give you summaries.

Note that you can also do hardware event counters with pgprof but to view them you need to export then import the profile. Note that event counters can take awhile since the driver may need to reply the kernel in order to collect more event counters.

For example:

pgprof -s --events all -o my.out
pgprof -i >& myprof.txt
vi myprof.txt

I hadn’t used the event counters in pgprof before so this was new for me as well. It works very well and easier than the CUDA command line profiler.

  • Mat

For future reference for anyone that stumbles upon this post: the primary issue appears to have been that several of my acc routines had automatic arrays. Some of these routines are called hundreds of thousands of times on the GPU. This meant that each CUDA core was having to constantly allocate and deallocate memory. Initial testing with automatic arrays removed from crucial routines shows that my compute loop on the GPU is now about 5.4x faster than perfect OpenMP scaling (i.e. 87x faster than serial, since Titan has 16 cores on a node).

Thanks so much to Mat (mkcolg) for reviewing my code and helping me to find this issue.

I think I’m having the exact same case. Thank you for updating the post AMJacobs. I’m only wondering the following:

This “cudbgGetAPIVersion” that you see that uses up so much of your runtime - was this related to the cudaMalloc / cudaFrees in automatic arrays? I’m asking because my malloc/frees don’t occupy that much time in the cpu profile, but I’m also seeing a massive overhead with cudbgGetAPIVersion -> start_thread -> clone. Am I on the right path suspecting the memory allocations / frees?