Profiling OpenACC

I’m trying to profile an OpenACC kernel and get the following output when PGI_ACC_TIME=1

Accelerator Kernel Timing data
  _Z7openACCifPKfS0_fPfRKi  NVIDIA  devicenum=0
    time(us): 5,372,424
    142: compute region reached 1000 times
        148: kernel launched 1000 times
            grid: [8x1024]  block: [128]
             device time(us): total=3,433,501 max=2,131 min=181 avg=3,433
            elapsed time(us): total=1,847,519 max=2,148 min=1,779 avg=1,847
    142: data region reached 2000 times
        142: data copyin transfers: 2000
             device time(us): total=1,296,938 max=672 min=646 avg=648
        157: data copyout transfers: 1000
             device time(us): total=641,985 max=673 min=639 avg=641

This issue is that the average kernel time is higher than the max. It’s obvious the average is the total divided by the number of loops, in this case 1000. It seems that something else is being included in the total.

Is this working correctly?

Hi mnicely,

Another oddity is that your elapsed time is less than you device time. Elapsed time is measured from the host and is inclusive of the device time so shouldn’t be less.

Are you using “async”? If so, you should have seen a warning but I don’t see one here. Something like:

Accelerator Kernel Timing data
Timing may be affected by asynchronous behavior
set PGI_ACC_SYNCHRONOUS to 1 to disable async() clauses

The command line profiler can give odd results when asynchronous is enabled. Try disabling it via setting the environment variable “PGI_ACC_SYNCHRONOUS=1”, or instead use PGPROF or NVPROF to profile.

Is this working correctly?

If you’re not using async, then no, but I’ve no idea what the problem is.

Can you post a reproducing example and give detail about the compiler version you’re using, the command line options your using to compile, the system OS, the GPU, and which CUDA driver you have?

-Mat

Hey Mat,

Sorry for the delayed reply. I didn’t receive an email notification telling me you replied. The code is at https://github.com/mnicely/computeWorks_examples/tree/master/computeWorks_mm.

Not using async, just simple matrix multiplication.

Matt

Hi Matt,

Thanks for the example. I was able to recreate the issue and think I have a basic understanding of what’s happening.

With this example, you’re using both OpenACC and CUDA C in the same file. “PGI_ACC_TIME” is just a quick profiler for the OpenACC code that for the “elapsed time” uses host-side timers and for “device time” calls the NVIDIA driver profiler “libcupti.so”. For some reason when the CUDA kernel is launched after the OpenACC generated kernel, the device time of the two is getting merged. This is why the elapsed time is less. Calling the CUDA kernel before the OpenACC time will give the expected result.

However for this case, I’d recommend you use nvprof or pgprof so you can get the kernel times for both OpenACC and CUDA.

% pgprof computeWorks_mm
No input given.
Running with N = 1024

Running Normal C: 3406.21 ms
Running OpenMP: 550.61 ms
Test passed.
Running BLAS: 5.31 ms
Test passed.
==182347== PGPROF is profiling process 182347, command: computeWorks_mm
Running OpenACC: Test passed.
Running CUBLAS: 0.18 ms
Test passed.
Running CUDA: 1.11 ms
Test passed.
==182347== Profiling application: computeWorks_mm
==182347== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   40.17%  1.53950s      1000  1.5395ms  1.5186ms  1.7032ms  openACC_169_gpu(int, float, float const *, float const *, float, float*, int const &)
                   28.82%  1.10439s      1000  1.1044ms  1.0897ms  1.1240ms  cudaKernel(int, float const *, float const *, float*)
                   17.98%  688.97ms      2005  343.63us  1.7600us  941.62us  [CUDA memcpy HtoD]
                    8.40%  322.10ms      1002  321.45us  318.84us  881.94us  [CUDA memcpy DtoH]
                    4.60%  176.45ms      1000  176.45us  173.50us  191.01us  volta_sgemm_128x32_nn
                    0.03%  1.2014ms      1000  1.2010us  1.1520us  1.5360us  [CUDA memset]
... continues ...



Sorry for the delayed reply. I didn’t receive an email notification telling me you replied.

No worries. Our UF doesn’t send out email notifications.

-Mat

Hey Mat,

That all makes sense. Thanks for looking into it.

Funny mentioned nvprof. When I run the following.

nvprof --print-openacc-summary ./computeWorks_mm 128

or

pgprof --print-openacc-summary ./computeWorks_mm 128

I get the following output

Running with N = 128

Running Normal C: 2.38 ms
Running OpenMP: 0.90 ms
Test passed.
Running BLAS: 0.04 ms
Test passed.
==13908== NVPROF is profiling process 13908, command: ./computeWorks_mm 128
==13908== Profiling application: ./computeWorks_mm 128
==13908== Profiling result:
No OpenACC activities were profiled.
==13908== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139

You mentioned in another post that you needed to run under “sudo”, is this also being run under “sudo”?

I ask because this is the error that I sometimes see when I don’t have permissions to run the device profiler, libcupti.so.

-Mat

No. I’m not running nvprof with sudo.

I wonder if this is related to a CUDA driver update where users don’t have profiling permission by default.

See: https://developer.nvidia.com/nvidia-development-tools-solutions-ERR_NVGPUCTRPERM-permission-issue-performance-counters

I have one system that I have this problem on and am checking with our IT folks to see if they update the permissions.