I’m trying to profile an OpenACC kernel and get the following output when PGI_ACC_TIME=1
Accelerator Kernel Timing data
_Z7openACCifPKfS0_fPfRKi NVIDIA devicenum=0
time(us): 5,372,424
142: compute region reached 1000 times
148: kernel launched 1000 times
grid: [8x1024] block: [128]
device time(us): total=3,433,501 max=2,131 min=181 avg=3,433
elapsed time(us): total=1,847,519 max=2,148 min=1,779 avg=1,847
142: data region reached 2000 times
142: data copyin transfers: 2000
device time(us): total=1,296,938 max=672 min=646 avg=648
157: data copyout transfers: 1000
device time(us): total=641,985 max=673 min=639 avg=641
This issue is that the average kernel time is higher than the max. It’s obvious the average is the total divided by the number of loops, in this case 1000. It seems that something else is being included in the total.
Is this working correctly?
Hi mnicely,
Another oddity is that your elapsed time is less than you device time. Elapsed time is measured from the host and is inclusive of the device time so shouldn’t be less.
Are you using “async”? If so, you should have seen a warning but I don’t see one here. Something like:
Accelerator Kernel Timing data
Timing may be affected by asynchronous behavior
set PGI_ACC_SYNCHRONOUS to 1 to disable async() clauses
The command line profiler can give odd results when asynchronous is enabled. Try disabling it via setting the environment variable “PGI_ACC_SYNCHRONOUS=1”, or instead use PGPROF or NVPROF to profile.
Is this working correctly?
If you’re not using async, then no, but I’ve no idea what the problem is.
Can you post a reproducing example and give detail about the compiler version you’re using, the command line options your using to compile, the system OS, the GPU, and which CUDA driver you have?
-Mat
Hey Mat,
Sorry for the delayed reply. I didn’t receive an email notification telling me you replied. The code is at https://github.com/mnicely/computeWorks_examples/tree/master/computeWorks_mm.
Not using async, just simple matrix multiplication.
Matt
Hi Matt,
Thanks for the example. I was able to recreate the issue and think I have a basic understanding of what’s happening.
With this example, you’re using both OpenACC and CUDA C in the same file. “PGI_ACC_TIME” is just a quick profiler for the OpenACC code that for the “elapsed time” uses host-side timers and for “device time” calls the NVIDIA driver profiler “libcupti.so”. For some reason when the CUDA kernel is launched after the OpenACC generated kernel, the device time of the two is getting merged. This is why the elapsed time is less. Calling the CUDA kernel before the OpenACC time will give the expected result.
However for this case, I’d recommend you use nvprof or pgprof so you can get the kernel times for both OpenACC and CUDA.
% pgprof computeWorks_mm
No input given.
Running with N = 1024
Running Normal C: 3406.21 ms
Running OpenMP: 550.61 ms
Test passed.
Running BLAS: 5.31 ms
Test passed.
==182347== PGPROF is profiling process 182347, command: computeWorks_mm
Running OpenACC: Test passed.
Running CUBLAS: 0.18 ms
Test passed.
Running CUDA: 1.11 ms
Test passed.
==182347== Profiling application: computeWorks_mm
==182347== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 40.17% 1.53950s 1000 1.5395ms 1.5186ms 1.7032ms openACC_169_gpu(int, float, float const *, float const *, float, float*, int const &)
28.82% 1.10439s 1000 1.1044ms 1.0897ms 1.1240ms cudaKernel(int, float const *, float const *, float*)
17.98% 688.97ms 2005 343.63us 1.7600us 941.62us [CUDA memcpy HtoD]
8.40% 322.10ms 1002 321.45us 318.84us 881.94us [CUDA memcpy DtoH]
4.60% 176.45ms 1000 176.45us 173.50us 191.01us volta_sgemm_128x32_nn
0.03% 1.2014ms 1000 1.2010us 1.1520us 1.5360us [CUDA memset]
... continues ...
Sorry for the delayed reply. I didn’t receive an email notification telling me you replied.
No worries. Our UF doesn’t send out email notifications.
-Mat
Hey Mat,
That all makes sense. Thanks for looking into it.
Funny mentioned nvprof. When I run the following.
nvprof --print-openacc-summary ./computeWorks_mm 128
or
pgprof --print-openacc-summary ./computeWorks_mm 128
I get the following output
Running with N = 128
Running Normal C: 2.38 ms
Running OpenMP: 0.90 ms
Test passed.
Running BLAS: 0.04 ms
Test passed.
==13908== NVPROF is profiling process 13908, command: ./computeWorks_mm 128
==13908== Profiling application: ./computeWorks_mm 128
==13908== Profiling result:
No OpenACC activities were profiled.
==13908== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139
You mentioned in another post that you needed to run under “sudo”, is this also being run under “sudo”?
I ask because this is the error that I sometimes see when I don’t have permissions to run the device profiler, libcupti.so.
-Mat
No. I’m not running nvprof with sudo.
I wonder if this is related to a CUDA driver update where users don’t have profiling permission by default.
See: https://developer.nvidia.com/nvidia-development-tools-solutions-ERR_NVGPUCTRPERM-permission-issue-performance-counters
I have one system that I have this problem on and am checking with our IT folks to see if they update the permissions.