Profiling OpenACC

mnicely · May 20, 2019, 6:14pm

I’m trying to profile an OpenACC kernel and get the following output when PGI_ACC_TIME=1

Accelerator Kernel Timing data
  _Z7openACCifPKfS0_fPfRKi  NVIDIA  devicenum=0
    time(us): 5,372,424
    142: compute region reached 1000 times
        148: kernel launched 1000 times
            grid: [8x1024]  block: [128]
             device time(us): total=3,433,501 max=2,131 min=181 avg=3,433
            elapsed time(us): total=1,847,519 max=2,148 min=1,779 avg=1,847
    142: data region reached 2000 times
        142: data copyin transfers: 2000
             device time(us): total=1,296,938 max=672 min=646 avg=648
        157: data copyout transfers: 1000
             device time(us): total=641,985 max=673 min=639 avg=641

This issue is that the average kernel time is higher than the max. It’s obvious the average is the total divided by the number of loops, in this case 1000. It seems that something else is being included in the total.

Is this working correctly?

MatColgrove · May 20, 2019, 7:16pm

Hi mnicely,

Another oddity is that your elapsed time is less than you device time. Elapsed time is measured from the host and is inclusive of the device time so shouldn’t be less.

Are you using “async”? If so, you should have seen a warning but I don’t see one here. Something like:

Accelerator Kernel Timing data
Timing may be affected by asynchronous behavior
set PGI_ACC_SYNCHRONOUS to 1 to disable async() clauses

The command line profiler can give odd results when asynchronous is enabled. Try disabling it via setting the environment variable “PGI_ACC_SYNCHRONOUS=1”, or instead use PGPROF or NVPROF to profile.

Is this working correctly?

If you’re not using async, then no, but I’ve no idea what the problem is.

Can you post a reproducing example and give detail about the compiler version you’re using, the command line options your using to compile, the system OS, the GPU, and which CUDA driver you have?

-Mat

mnicely · May 30, 2019, 7:38pm

Hey Mat,

Sorry for the delayed reply. I didn’t receive an email notification telling me you replied. The code is at https://github.com/mnicely/computeWorks_examples/tree/master/computeWorks_mm.

Not using async, just simple matrix multiplication.

Matt

MatColgrove · May 30, 2019, 8:25pm

Hi Matt,

Thanks for the example. I was able to recreate the issue and think I have a basic understanding of what’s happening.

With this example, you’re using both OpenACC and CUDA C in the same file. “PGI_ACC_TIME” is just a quick profiler for the OpenACC code that for the “elapsed time” uses host-side timers and for “device time” calls the NVIDIA driver profiler “libcupti.so”. For some reason when the CUDA kernel is launched after the OpenACC generated kernel, the device time of the two is getting merged. This is why the elapsed time is less. Calling the CUDA kernel before the OpenACC time will give the expected result.

However for this case, I’d recommend you use nvprof or pgprof so you can get the kernel times for both OpenACC and CUDA.

% pgprof computeWorks_mm
No input given.
Running with N = 1024

Running Normal C: 3406.21 ms
Running OpenMP: 550.61 ms
Test passed.
Running BLAS: 5.31 ms
Test passed.
==182347== PGPROF is profiling process 182347, command: computeWorks_mm
Running OpenACC: Test passed.
Running CUBLAS: 0.18 ms
Test passed.
Running CUDA: 1.11 ms
Test passed.
==182347== Profiling application: computeWorks_mm
==182347== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   40.17%  1.53950s      1000  1.5395ms  1.5186ms  1.7032ms  openACC_169_gpu(int, float, float const *, float const *, float, float*, int const &)
                   28.82%  1.10439s      1000  1.1044ms  1.0897ms  1.1240ms  cudaKernel(int, float const *, float const *, float*)
                   17.98%  688.97ms      2005  343.63us  1.7600us  941.62us  [CUDA memcpy HtoD]
                    8.40%  322.10ms      1002  321.45us  318.84us  881.94us  [CUDA memcpy DtoH]
                    4.60%  176.45ms      1000  176.45us  173.50us  191.01us  volta_sgemm_128x32_nn
                    0.03%  1.2014ms      1000  1.2010us  1.1520us  1.5360us  [CUDA memset]
... continues ...

Sorry for the delayed reply. I didn’t receive an email notification telling me you replied.

No worries. Our UF doesn’t send out email notifications.

-Mat

mnicely · May 30, 2019, 9:03pm

Hey Mat,

That all makes sense. Thanks for looking into it.

Funny mentioned nvprof. When I run the following.

nvprof --print-openacc-summary ./computeWorks_mm 128

or

pgprof --print-openacc-summary ./computeWorks_mm 128

I get the following output

Running with N = 128

Running Normal C: 2.38 ms
Running OpenMP: 0.90 ms
Test passed.
Running BLAS: 0.04 ms
Test passed.
==13908== NVPROF is profiling process 13908, command: ./computeWorks_mm 128
==13908== Profiling application: ./computeWorks_mm 128
==13908== Profiling result:
No OpenACC activities were profiled.
==13908== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139

MatColgrove · May 30, 2019, 9:24pm

You mentioned in another post that you needed to run under “sudo”, is this also being run under “sudo”?

I ask because this is the error that I sometimes see when I don’t have permissions to run the device profiler, libcupti.so.

-Mat

mnicely · May 30, 2019, 9:42pm

No. I’m not running nvprof with sudo.

MatColgrove · May 30, 2019, 10:39pm

I wonder if this is related to a CUDA driver update where users don’t have profiling permission by default.

See: https://developer.nvidia.com/nvidia-development-tools-solutions-ERR_NVGPUCTRPERM-permission-issue-performance-counters

I have one system that I have this problem on and am checking with our IT folks to see if they update the permissions.

Topic		Replies	Views
OpenACC profiling with NVProf Legacy PGI Compilers	2	7000	January 25, 2016
What is the defferent between"GPU activities" and "API calls"? Legacy PGI Compilers	3	3197	June 4, 2019
Wrong OpenACC data copyin times using pgcc ver 14.3 Legacy PGI Compilers	4	10939	May 21, 2014
optimize runtime Legacy PGI Compilers	1	1337	March 23, 2018
Launch of the kernel Legacy PGI Compilers	4	2865	October 18, 2017
Getting Performance on Titan Legacy PGI Compilers	12	11813	December 27, 2016
pgprof: FileError.File 'pgprof.out' Legacy PGI Compilers	2	8629	June 12, 2013
Profiling only partially works nvc, nvc++ and nvfortran	10	1378	July 21, 2020
OpenACC reporting "Illegal address during kernel execut Legacy PGI Compilers	5	14474	January 12, 2017
kernel launch overhead Legacy PGI Compilers	8	12597	July 24, 2014

Profiling OpenACC

Related topics