Very slow performance of some loops

elephant · July 22, 2011, 1:05pm

Hi

I am facing performance problems of my code. I have several subroutine calls in a main program. all the subroutines are ported to the GPU. The time spent in the loops of the subroutines should be all in the order of 10^-3 to 10^-4 seconds. When I compile the program and run it several times, there are always three loops that are VERY slow (order of 1-2 seconds). Strange thing is that this times are jumping around, meaning that for one execution loop3 in subroutine1 has 2sec while as when I compile and execute the program again then loop6 in subroutine1 has about 2s…
Please check the following output:

 **********************************************************
 Time for Sub1,Loop1 [s]        =   8.2000000000000001E-005
 Time for Sub1,Loop2 [s]        =   2.0699999999999999E-004
 Time for Sub1,Loop3 [s]        =    2.107871000000000        !!!
 Time for Sub1,Loop4 [s]        =   1.5999999999999999E-004
 Time for Sub1,Loop5 [s]        =   8.0599999999999997E-004
 Time for Sub1,Loop6 [s]        =   8.3400000000000002E-003
 Time for Sub1,Loop7 [s]        =   2.4399999999999999E-004
 Time for Sub1,Loop8 [s]        =   1.0300000000000000E-004
 Time for Sub1,Loop9 [s]        =   8.2999999999999998E-005
 Time for Sub1,Loop10[s]        =   2.2551999999999999E-002
 **********************************************************
 Time for Sub2,Loop1 [s]        =   1.7500000000000000E-004
 Time for Sub2,Loop2 [s]        =   8.3999999999999995E-005
 Time for Sub2,Loop3 [s]        =   7.0999999999999991E-005
 Time for Sub2,Loop4 [s]        =   2.7799999999999998E-004
 Time for Sub2,Loop5 [s]        =   1.3799999999999999E-004
 Time for Sub2,Loop6 [s]        =   6.3000000000000000E-005
 **********************************************************
 Time for Sub3,Loop1 [s]        =   8.6379999999999998E-003
 Time for Sub3,Loop2 [s]        =   0.2743080000000000        !!!
 Time for Sub3,Loop3 [s]        =   8.5999999999999990E-005
 **********************************************************
 Time for Sub4,Loop1 [s]        =   1.5999999999999999E-004
 Time for Sub4,Loop2 [s]        =   1.7799999999999999E-004
 Time for Sub4,Loop3 [s]        =    2.636425000000000        !!!
 **********************************************************
 
  *********************************************************
 Time for Sub1,Loop1 [s]        =   6.7000000000000002E-005
 Time for Sub1,Loop2 [s]        =   1.5233000000000000E-002
 Time for Sub1,Loop3 [s]        =   1.1600000000000000E-004
 Time for Sub1,Loop4 [s]        =   1.3799999999999999E-004
 Time for Sub1,Loop5 [s]        =   2.4163000000000000E-002
 Time for Sub1,Loop6 [s]        =   3.7700000000000000E-004
 Time for Sub1,Loop7 [s]        =   1.9899999999999999E-004
 Time for Sub1,Loop8 [s]        =   6.9999999999999994E-005
 Time for Sub1,Loop9 [s]        =   6.4999999999999994E-005
 Time for Sub1,Loop10[s]        =    1.978296000000000        !!!
 **********************************************************
 Time for Sub2,Loop1 [s]        =   1.6899999999999999E-004
 Time for Sub2,Loop2 [s]        =   7.4999999999999993E-005
 Time for Sub2,Loop3 [s]        =   7.3999999999999996E-005
 Time for Sub2,Loop4 [s]        =   2.5799999999999998E-004
 Time for Sub2,Loop5 [s]        =   1.3500000000000000E-004
 Time for Sub2,Loop6 [s]        =   6.7999999999999999E-005
 **********************************************************
 Time for Sub3,Loop1 [s]        =   8.6379999999999998E-003
 Time for Sub3,Loop2 [s]        =   0.2743080000000000        !!!
 Time for Sub3,Loop3 [s]        =   8.5999999999999990E-005
 **********************************************************
 Time for Sub4,Loop1 [s]        =   8.0800000000000002E-004
 Time for Sub4,Loop2 [s]        =    1.442023000000000        !!!
 Time for Sub4,Loop3 [s]        =   8.7599999999999993E-004
 **********************************************************

Is it possible that from time to time the launch of a new kernel might lead to a 1 to 2 second breakdown of the program? That the GPU stands still? To me it seems like arbitrary behaviour where the slow loop is. I studyied also the compiler message carefouly and there is no device-host data transfer during execution!
Thank you very much!!!

MatColgrove · July 22, 2011, 4:09pm

Hi elephant,

How are you timing the loops? If you are using a CPU timer, then it’s possible that the OS is interrupting your process, and this delay is being picked up in your timer.

If using the PGI Accelerator Model, add the “-ta=nvidia,time” option to get some basic profiling information about your kernels. For CUDA Fortran, try using the CUDA events to get more accurate timing information.

Example:

...
    real :: func_time
    type(cudaEvent) :: func_start, func_end
...
    istat = cudaEventCreate(func_start)
    istat = cudaEventCreate(func_end)    i
    istat = cudaEventRecord(func_start, 0)
    call  mykernel<<<dimGrid>>>(dX, dY, dTemp, N)
    istat = cudaEventRecord(func_end, 0)
    istat = cudaThreadSynchronize()
    istat = cudaEventElapsedTime(func_time, func_start, func_end)
    mytime = func_time/1000
...

Also, try using a profiler to get a better understanding on what’s happening. You can use “pgcollect -cuda ” to gather performance data and then view this data in the PGI profiler, pgprof. Or set the environment variable “CUDA_PROFILE=1” to enable the CUDA profiler to gather information about your program.

Hope this helps,
Mat

elephant · July 22, 2011, 6:30pm

Hey Mat,
thank you for the fast response!
I am using the pgi accelerator model. The way I am capturing the time is as following. I haved used the “-ta=nvidia,time” flag before on some test loops but have noticed that this slowes down the performance. I use the “call system clock” inbetween two accelerated loops, not in a compute region. Do you think this still can cause this problem?

!$acc region
   loop1
!$acc end region

call sytem clock...

!$acc region
   loop2
!$acc end region

Can I use “pgcollect -cuda ” and “CUDA_PROFILE=1” even if I dont use CUDA?
Thank you!

MatColgrove · July 22, 2011, 9:29pm

Do you think this still can cause this problem?

Because the performance difference varies from routine to routine and run to run, it just seems like it.

Can I use “pgcollect -cuda ” and “CUDA_PROFILE=1” even if I dont use CUDA?

Both work just take off the “-cuda” from pgcollect. The “-cuda” enables GPU hardware counters. We’re working on adding this support for the accelerator model, but for now you only get information about how much time each kernel takes, how much time was spent copying data, and initialization time. (It’s similar to the -ta=nvidia,time output).

The CUDA prof info is the same, just not aggregated.

Mat

Topic		Replies	Views
Unexplained GPU call overhead one call 13ms; N calls N*7 ms CUDA Programming and Performance	6	2535	August 7, 2008
Performance decrease with PGI 12.1 Legacy PGI Compilers	11	6336	May 10, 2012
Getting Performance on Titan Legacy PGI Compilers	12	11820	December 27, 2016
First kernel execution takes longer CUDA Programming and Performance	8	2888	December 8, 2014
compiler output -Minfo: one loop slower than the other Legacy PGI Compilers	2	2897	July 6, 2011
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8643	December 18, 2008
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1778	July 19, 2022
How to make loops with reduction faster? Legacy PGI Compilers	3	2392	March 5, 2013
CUDA slower than CPU? CUDA Programming and Performance	7	875	August 18, 2023
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10587	June 21, 2009

Very slow performance of some loops

Related topics