Low GPU-Load

Benedikt · May 15, 2015, 7:39am

Hi

I’m using OpenACC-Fortran (PGI-Compiler 15.4) with nvidia on windows.

The utility “nvidia-smi.exe” shows quite low load-rate of the GPU. (About 27% in the example.)

+------------------------------------------------------+
| NVIDIA-SMI 340.66     Driver Version: 340.66         |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K2000        TCC  | 0000:0A:00.0     Off |                  N/A |
| 30%   40C    P0    N/A /  N/A |     60MiB /  2047MiB |     27%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0      8044  ..._trunk\builds\de\pgi\release\myprog.exe    53MiB |
+-----------------------------------------------------------------------------+

Which tools shall I use to measure, why this load is so low?

I tried pgcollect as a pgprof-preprocessor. But I get an error message “pgi-prof-win32: LICENSE MANAGER PROBLEM: License server does not support this feature.”

Benedikt

Benedikt · May 15, 2015, 3:42pm

Sorry. I found the LICENSE MANAGER PROBLEM on my own.

Will give pgcollect/pgprof a try…

MatColgrove · May 15, 2015, 4:12pm

Which tools shall I use to measure, why this load is so low?

For this issue, I would look at the output from the command line profilers. Try setting the environment variable “PGI_ACC_TIME=1” and look at how many gangs are being scheduled. If it’s low, then the problem is that the data set you’re using isn’t large enough to fully saturate the system.

Also, try using NVIDIA’s command line profiler by setting the environment variable “COMPUTE_PROFILE=1”. After you run the program, a “cuda_profile” text file will be generated. Look for the “occupancy” of each kernel. If it’s low, compile your program with the flag “-ta=tesla:ptxinfo” and see how many registers and shared memory are used by the kernel. The more registers used per thread will lower the occupancy of the kernel. You can then add the flag “-ta=tesla:maxregcount:n” to lower the maximum number of registers used and increase the occupancy. The caveat being that with fewer registers, local memory will “spill” to the L1 cache, and when that runs out, it will spill to global memory and cause a performance slow down. Ideally, lower the registers to the point where they increase occupancy but before the spill to global memory.

A third option is to use NVVP, Nvidia’s visual profiler. It has some built-in analytics which can help in diagnosing where the issue is.

Hope this helps,
Mat

Benedikt · May 16, 2015, 1:42pm

Mat, I thought the GPU-load is low, because there’s to much CPU-Code between the GPU-Code. You suggest to think about kernels, gang, etc.

OK. Problably you are right, but I’d really like to know how much time is spent on CPU/GPU.

I worked from time to time with PGI_ACC_TIME. So far I never looked at “how many gangs are being scheduled”.

Here’s a part of the output. Where do I have to watch?

U:/ws/hydroas_trunk/AS\Hydro_AS-2D_2dm.inc
  predict  NVIDIA  devicenum=0
    time(us): 0
    1: data region reached 836 times
    1838: compute region reached 836 times
        1843: kernel launched 836 times
            grid: [16767]  block: [128]
            elapsed time(us): total=1,815,000 max=1,015,000 min=0 avg=2,171
        1865: kernel launched 828 times
            grid: [1-3]  block: [128]
            elapsed time(us): total=201,000 max=16,000 min=0 avg=242
        1865: reduction kernel launched 828 times
            grid: [1]  block: [256]
            elapsed time(us): total=248,000 max=16,000 min=0 avg=299
        2104: kernel launched 836 times
            grid: [1]  block: [1]
            elapsed time(us): total=201,000 max=16,000 min=0 avg=240
    2141: compute region reached 836 times
        2147: kernel launched 836 times
            grid: [3-4]  block: [128]
            elapsed time(us): total=124,000 max=16,000 min=0 avg=148
U:/ws/hydroas_trunk/AS\Hydro_AS-2D_2dm.inc
  ucvc  NVIDIA  devicenum=0
    time(us): 0
    1: data region reached 836 times
    2275: compute region reached 836 times
        2281: kernel launched 836 times
            grid: [3-4]  block: [128]
            elapsed time(us): total=141,000 max=16,000 min=0 avg=168
        2301: kernel launched 836 times
            grid: [11059]  block: [128]
            elapsed time(us): total=405,000 max=16,000 min=0 avg=484
        2302: kernel launched 836 times
            grid: [11059]  block: [128]
            elapsed time(us): total=1,562,000 max=1,016,000 min=0 avg=1,868
        2306: kernel launched 828 times
            grid: [1-2]  block: [128]
            elapsed time(us): total=170,000 max=16,000 min=0 avg=205
        2352: kernel launched 816 times
            grid: [1]  block: [128]
            elapsed time(us): total=142,000 max=16,000 min=0 avg=174

This outut is generated for a testrun where I suppose, that the loops are to short to be really effecient on GPU. But how can I see at from this information?

Benedikt

MatColgrove · May 18, 2015, 5:08pm

Hi Benedikt,

Are you running on Windows? Your PGI_ACC_TIME results are odd in that you have “min” values at zero and a very high “max” time. On Windows, setting both “PGI_ACC_TIME” and “COMPUTE_PROFILE” are required due to an problem with the interface to the CUDA driver.

but I’d really like to know how much time is spent on CPU/GPU.

There’s not really a good option for this on Windows. What might work best is a combination of NVVP and PGPROF. NVVP will show GPU performance as well as where the CPU code is running. You can then run a CPU only version, compiled with “-Mprof=lines”, under PGPROF to see where the CPU time is being spent.

On Linux, you’d have more options, including Vampirtrace

Mat