What is the defferent between"GPU activities" and "API calls"?

When I use pgprof to profile my program, I got an error as follows:

#:~/CFL3D$ pgprof ./cfl3d_seq <HSCM3_fine.inp 
==32521== PGPROF is profiling process 32521, command: ./cfl3d_seq
==32521== Profiling application: ./cfl3d_seq
==32521== Profiling result:
No kernels were profiled.
No API activities were profiled.
==32521== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139

so, I used nvprof instead. Here is the profile result. I want to know the total running time on GPU is the time:(API calls + GPU activities)? Or “API calls” include “GPU activities”? And how to reduce the time of “cudaFree” and “cudaMalloc” in “API calls”?

==26402== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   54.09%  20.4365s     52224  391.32us     832ns  19.270ms  [CUDA memcpy HtoD]
                   23.74%  8.96840s      4032  2.2243ms  50.080us  18.944ms  [CUDA memcpy DtoH]
                    5.65%  2.13310s       192  11.110ms  8.5115ms  14.219ms  twokernel_do2335_2_
                    5.58%  2.10966s       192  10.988ms  8.4016ms  14.030ms  twokernel_do2333_2_
                    5.36%  2.02546s       192  10.549ms  8.0914ms  13.581ms  twokernel_do2334_2_
                    0.62%  233.64ms       192  1.2169ms  929.29us  1.6044ms  twokernel_do893_
				... ...
                    0.01%  4.1183ms       192  21.449us  20.864us  22.113us  diagjkernel_do7016_
      API calls:   72.09%  35.0658s     56256  623.32us  8.6680us  20.016ms  cudaMemcpy
                   22.30%  10.8480s     52224  207.72us  4.8030us  14.229ms  cudaFree
                    5.07%  2.46781s     52224  47.254us  5.4810us  517.61ms  cudaMalloc
                    0.53%  257.91ms     10560  24.423us  8.0070us  568.38us  cudaLaunchKernel
                    0.00%  805.97us         1  805.97us  805.97us  805.97us  cuDeviceTotalMem
                    0.00%  547.29us        96  5.7000us     256ns  214.70us  cuDeviceGetAttribute
                    0.00%  67.739us         1  67.739us  67.739us  67.739us  cuDeviceGetName
                    0.00%  7.1660us         1  7.1660us  7.1660us  7.1660us  cuDeviceGetPCIBusId
                    0.00%  3.4950us         3  1.1650us     345ns  1.9350us  cuDeviceGetCount
                    0.00%  2.4870us         1  2.4870us  2.4870us  2.4870us  cuDriverGetVersion
                    0.00%  1.5180us         2     759ns     350ns  1.1680us  cuDeviceGet
                    0.00%     529ns         1     529ns     529ns     529ns  cuDeviceGetUuid

Hi xll_blt,

When I use pgprof to profile my program, I got an error as follows:

What CUDA Driver do yo have installed? It seems that a recent update to the CUDA drivers has disabled user level profiling due to a potential security risk. See: https://nvidia.custhelp.com/app/answers/detail/a_id/4738

With the authorized workaround at: https://developer.nvidia.com/nvidia-development-tools-solutions-ERR_NVGPUCTRPERM-permission-issue-performance-counters

Though since you’re able to run with nvprof, which is also effected, this may be different issue where you need to compile your OpenACC program with the same CUDA version (ex. “-ta=tesla:cuda10.1”) as the driver. Note that nvprof and pgprof are the same program with a few different defaults, but possibly at different versions depending on which PGI version you’re using.

Or “API calls” include “GPU activities”?

Yes, the API Call may include some of the GPU Activities. For example, cudaMemcpy taking 35 seconds which is inclusive of the 29 seconds shown in the GPU activities section. The remaining 6 seconds is most likely overhead due to the large number of calls.

And how to reduce the time of “cudaFree” and “cudaMalloc” in “API calls”?

I’m assuming you’re using OpenACC? If so, then these will be associated with the OpenACC data regions in your code. Hence to reduce this overhead, you’ll want to move your data regions earlier in the code so data isn’t being allocated and deleted on the device over and over again.

If you aren’t using explicit data regions, you should consider adding them. Compute regions (“parallel” or “kernel”) have an implicit data region so data would be created/deleted each time the code enters the region.

-Mat

Hi Mat,
Here is the version I am using:
Driver Version: 418.67 CUDA Version: 10.1 PGI Version:19.4
This is a CUDA-FORTRAN program and the array I used on GPU is automatic array. Besides, when I combined several small arrays into one large array, the time of “cudaFree” and “cudaMalloc” reduced, but the time of “cudaMemcpy” increased and the total running time of the program increased.

Driver Version: 418.67 CUDA Version

I had the same profiling issue on a system with this driver and applying the above work around fixes the issue.

This is a CUDA-FORTRAN program and the array I used on GPU is automatic array.

I assuming the automatic array is on the host side rather than in you kernel?

when I combined several small arrays into one large array, the time of “cudaFree” and “cudaMalloc” reduced, but the time of “cudaMemcpy” increased and the total running time of the program increased.

Are you copying the data as one large contiguous chuck, or many smaller chunks? Larger contiguous chunks are usually better.

Are the host arrays that you copy ot/from the device also automatic? If not, you may want to add the “pinned” attribute to the host arrays. Though pinned comes with a high allocation cost, so would not be beneficial for automatics.

-Mat