Very slow performance of some loops


I am facing performance problems of my code. I have several subroutine calls in a main program. all the subroutines are ported to the GPU. The time spent in the loops of the subroutines should be all in the order of 10^-3 to 10^-4 seconds. When I compile the program and run it several times, there are always three loops that are VERY slow (order of 1-2 seconds). Strange thing is that this times are jumping around, meaning that for one execution loop3 in subroutine1 has 2sec while as when I compile and execute the program again then loop6 in subroutine1 has about 2s…
Please check the following output:

 Time for Sub1,Loop1 [s]        =   8.2000000000000001E-005
 Time for Sub1,Loop2 [s]        =   2.0699999999999999E-004
 Time for Sub1,Loop3 [s]        =    2.107871000000000        !!!
 Time for Sub1,Loop4 [s]        =   1.5999999999999999E-004
 Time for Sub1,Loop5 [s]        =   8.0599999999999997E-004
 Time for Sub1,Loop6 [s]        =   8.3400000000000002E-003
 Time for Sub1,Loop7 [s]        =   2.4399999999999999E-004
 Time for Sub1,Loop8 [s]        =   1.0300000000000000E-004
 Time for Sub1,Loop9 [s]        =   8.2999999999999998E-005
 Time for Sub1,Loop10[s]        =   2.2551999999999999E-002
 Time for Sub2,Loop1 [s]        =   1.7500000000000000E-004
 Time for Sub2,Loop2 [s]        =   8.3999999999999995E-005
 Time for Sub2,Loop3 [s]        =   7.0999999999999991E-005
 Time for Sub2,Loop4 [s]        =   2.7799999999999998E-004
 Time for Sub2,Loop5 [s]        =   1.3799999999999999E-004
 Time for Sub2,Loop6 [s]        =   6.3000000000000000E-005
 Time for Sub3,Loop1 [s]        =   8.6379999999999998E-003
 Time for Sub3,Loop2 [s]        =   0.2743080000000000        !!!
 Time for Sub3,Loop3 [s]        =   8.5999999999999990E-005
 Time for Sub4,Loop1 [s]        =   1.5999999999999999E-004
 Time for Sub4,Loop2 [s]        =   1.7799999999999999E-004
 Time for Sub4,Loop3 [s]        =    2.636425000000000        !!!
 Time for Sub1,Loop1 [s]        =   6.7000000000000002E-005
 Time for Sub1,Loop2 [s]        =   1.5233000000000000E-002
 Time for Sub1,Loop3 [s]        =   1.1600000000000000E-004
 Time for Sub1,Loop4 [s]        =   1.3799999999999999E-004
 Time for Sub1,Loop5 [s]        =   2.4163000000000000E-002
 Time for Sub1,Loop6 [s]        =   3.7700000000000000E-004
 Time for Sub1,Loop7 [s]        =   1.9899999999999999E-004
 Time for Sub1,Loop8 [s]        =   6.9999999999999994E-005
 Time for Sub1,Loop9 [s]        =   6.4999999999999994E-005
 Time for Sub1,Loop10[s]        =    1.978296000000000        !!!
 Time for Sub2,Loop1 [s]        =   1.6899999999999999E-004
 Time for Sub2,Loop2 [s]        =   7.4999999999999993E-005
 Time for Sub2,Loop3 [s]        =   7.3999999999999996E-005
 Time for Sub2,Loop4 [s]        =   2.5799999999999998E-004
 Time for Sub2,Loop5 [s]        =   1.3500000000000000E-004
 Time for Sub2,Loop6 [s]        =   6.7999999999999999E-005
 Time for Sub3,Loop1 [s]        =   8.6379999999999998E-003
 Time for Sub3,Loop2 [s]        =   0.2743080000000000        !!!
 Time for Sub3,Loop3 [s]        =   8.5999999999999990E-005
 Time for Sub4,Loop1 [s]        =   8.0800000000000002E-004
 Time for Sub4,Loop2 [s]        =    1.442023000000000        !!!
 Time for Sub4,Loop3 [s]        =   8.7599999999999993E-004

Is it possible that from time to time the launch of a new kernel might lead to a 1 to 2 second breakdown of the program? That the GPU stands still? To me it seems like arbitrary behaviour where the slow loop is. I studyied also the compiler message carefouly and there is no device-host data transfer during execution!
Thank you very much!!!

Hi elephant,

How are you timing the loops? If you are using a CPU timer, then it’s possible that the OS is interrupting your process, and this delay is being picked up in your timer.

If using the PGI Accelerator Model, add the “-ta=nvidia,time” option to get some basic profiling information about your kernels. For CUDA Fortran, try using the CUDA events to get more accurate timing information.


    real :: func_time
    type(cudaEvent) :: func_start, func_end
    istat = cudaEventCreate(func_start)
    istat = cudaEventCreate(func_end)    i
    istat = cudaEventRecord(func_start, 0)
    call  mykernel<<<dimGrid>>>(dX, dY, dTemp, N)
    istat = cudaEventRecord(func_end, 0)
    istat = cudaThreadSynchronize()
    istat = cudaEventElapsedTime(func_time, func_start, func_end)
    mytime = func_time/1000

Also, try using a profiler to get a better understanding on what’s happening. You can use "pgcollect -cuda " to gather performance data and then view this data in the PGI profiler, pgprof. Or set the environment variable “CUDA_PROFILE=1” to enable the CUDA profiler to gather information about your program.

Hope this helps,

Hey Mat,
thank you for the fast response!
I am using the pgi accelerator model. The way I am capturing the time is as following. I haved used the “-ta=nvidia,time” flag before on some test loops but have noticed that this slowes down the performance. I use the “call system clock” inbetween two accelerated loops, not in a compute region. Do you think this still can cause this problem?

!$acc region
!$acc end region

call sytem clock...

!$acc region
!$acc end region

Can I use "pgcollect -cuda " and “CUDA_PROFILE=1” even if I dont use CUDA?
Thank you!

Do you think this still can cause this problem?

Because the performance difference varies from routine to routine and run to run, it just seems like it.

Can I use "pgcollect -cuda " and “CUDA_PROFILE=1” even if I dont use CUDA?

Both work just take off the “-cuda” from pgcollect. The “-cuda” enables GPU hardware counters. We’re working on adding this support for the accelerator model, but for now you only get information about how much time each kernel takes, how much time was spent copying data, and initialization time. (It’s similar to the -ta=nvidia,time output).

The CUDA prof info is the same, just not aggregated.

  • Mat