Timing kernel with clock_gettime vs timing with nvprof

bhushanattarde703 · December 27, 2021, 7:30am

I am working with matrix vector multiplication algorithm development on Jetson xavier agx. I want to collect timing information using clock_gettime as shown in following code :

Timings  timings(1000, 0);

for (long long samples = -1; samples < 100000; samples++) {

    struct timespec before;
    clock_gettime(CLOCK_REALTIME, &before);

    cudaMemcpy(d_y, d_BD, sizeBD, cudaMemcpyDeviceToDevice);

    beta = beta * 1.01; // Just vary the input a little

    // Calculate: y = (alpha*A) * x + (beta*y)
    cublasDgemv(handle, CUBLAS_OP_N, M, K, &alpha, d_AC, M, d_x, incx, &beta, d_y, incy);


    // Store next state vector
    cudaMemcpy(d_x, d_y, sizeX, cudaMemcpyDeviceToDevice);

    struct timespec after;
    clock_gettime(CLOCK_REALTIME, &after);

    // Copy back to the host
    cudaMemcpy(h_y, d_y, sizeBD, cudaMemcpyDeviceToHost);


    if (samples >= 0) {
        timings.append(before, after);
    }

}

Using clock_gettime to time kernel I am getting :

Tests: 100000
average: 46.3 us
max: 8929.3 us ← (iteration 87380)
min: 41.7 us
stddev: 32.2 us

When I am profiling the same kernel I am getting following result :

==9021== Profiling application: ./MatVecCublasPinned
==9021== Profiling result:
        Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:   83.19%  2.22943s    100001  22.294us  19.970us  30.819us  void 
           gemv2N_kernel<int, int, double, double, double, int=128, int=32, int=4, int=4, int=1, cublasGemvParams<cublasGemvTensor<double const >, cublasGemvTensor<double>, double>>(double const )
               14.83%  397.43ms    200002  1.9870us  1.6000us  5.9210us  [CUDA memcpy DtoD]
                1.98%  52.937ms    100001     529ns     416ns  2.8810us  [CUDA memcpy DtoH]
                0.00%  29.411us         4  7.3520us     512ns  27.331us  [CUDA memcpy HtoD]
  API calls:   68.45%  5.62046s    300007  18.734us  13.473us  46.028ms  cudaMemcpy
               15.24%  1.25117s    100001  12.511us  11.073us  8.8447ms  cudaLaunchKernel
               13.76%  1.13014s         7  161.45ms  2.9440us  730.11ms  cudaFree
                2.03%  166.58ms         4  41.645ms  12.704us  166.51ms  cudaMallocHost
                0.49%  40.543ms    100001     405ns     320ns  41.187us  cudaGetLastError
                0.01%  585.95us         8  73.243us  9.6000us  311.86us  cudaMalloc
                0.00%  389.56us         4  97.389us  13.985us  331.35us  cudaFreeHost
                0.00%  298.80us       285  1.0480us     352ns  46.210us  cuDeviceGetAttribute
                0.00%  200.36us        90  2.2260us  1.2800us  24.993us  cudaFuncSetAttribute
                0.00%  40.483us         3  13.494us  8.8010us  20.930us  cuDeviceTotalMem
                0.00%  33.154us        18  1.8410us  1.2800us  6.9450us  cudaEventCreateWithFlags
                0.00%  19.969us         1  19.969us  19.969us  19.969us  cudaGetDevice
                0.00%  11.904us         5  2.3800us     800ns  5.8240us  cuDeviceGetCount
                0.00%  10.785us        11     980ns     704ns  2.8160us  cudaDeviceGetAttribute
                0.00%  9.8570us         2  4.9280us  4.8000us  5.0570us  cuInit
                0.00%  8.7690us         3  2.9230us  2.6880us  3.0410us  cuDeviceGetUuid
                0.00%  4.6400us         2  2.3200us  1.9840us  2.6560us  cuDriverGetVersion
                0.00%  4.5760us         3  1.5250us  1.4080us  1.6000us  cuDeviceGetName
                0.00%  4.4160us         4  1.1040us     800ns  1.8240us  cuDeviceGet

I am confused why there is so much time difference between clock_gettime() and nvprof result. Which method shows the accurate result? Could someone help me with this issue?

Thanks in Advance…!!

AastaLLL · December 28, 2021, 3:11am

Hi,

Please note that nvprof measures the GPU duration.

In your implementation, the CPU returns immediately after launching the kernel task.
This means the GPU job is not finished yet when calling clock_gettime(CLOCK_REALTIME, &after).

You can fix this by blocking the CPU with the CUDA synchronization call.

Thanks.

system · January 26, 2022, 3:09am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.