I am working with matrix vector multiplication algorithm development on Jetson xavier agx. I want to collect timing information using clock_gettime as shown in following code :
Timings timings(1000, 0);
for (long long samples = -1; samples < 100000; samples++) {
struct timespec before;
clock_gettime(CLOCK_REALTIME, &before);
cudaMemcpy(d_y, d_BD, sizeBD, cudaMemcpyDeviceToDevice);
beta = beta * 1.01; // Just vary the input a little
// Calculate: y = (alpha*A) * x + (beta*y)
cublasDgemv(handle, CUBLAS_OP_N, M, K, &alpha, d_AC, M, d_x, incx, &beta, d_y, incy);
// Store next state vector
cudaMemcpy(d_x, d_y, sizeX, cudaMemcpyDeviceToDevice);
struct timespec after;
clock_gettime(CLOCK_REALTIME, &after);
// Copy back to the host
cudaMemcpy(h_y, d_y, sizeBD, cudaMemcpyDeviceToHost);
if (samples >= 0) {
timings.append(before, after);
}
}
Using clock_gettime to time kernel I am getting :
Tests: 100000
average: 46.3 us
max: 8929.3 us ← (iteration 87380)
min: 41.7 us
stddev: 32.2 us
When I am profiling the same kernel I am getting following result :
==9021== Profiling application: ./MatVecCublasPinned
==9021== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 83.19% 2.22943s 100001 22.294us 19.970us 30.819us void
gemv2N_kernel<int, int, double, double, double, int=128, int=32, int=4, int=4, int=1, cublasGemvParams<cublasGemvTensor<double const >, cublasGemvTensor<double>, double>>(double const )
14.83% 397.43ms 200002 1.9870us 1.6000us 5.9210us [CUDA memcpy DtoD]
1.98% 52.937ms 100001 529ns 416ns 2.8810us [CUDA memcpy DtoH]
0.00% 29.411us 4 7.3520us 512ns 27.331us [CUDA memcpy HtoD]
API calls: 68.45% 5.62046s 300007 18.734us 13.473us 46.028ms cudaMemcpy
15.24% 1.25117s 100001 12.511us 11.073us 8.8447ms cudaLaunchKernel
13.76% 1.13014s 7 161.45ms 2.9440us 730.11ms cudaFree
2.03% 166.58ms 4 41.645ms 12.704us 166.51ms cudaMallocHost
0.49% 40.543ms 100001 405ns 320ns 41.187us cudaGetLastError
0.01% 585.95us 8 73.243us 9.6000us 311.86us cudaMalloc
0.00% 389.56us 4 97.389us 13.985us 331.35us cudaFreeHost
0.00% 298.80us 285 1.0480us 352ns 46.210us cuDeviceGetAttribute
0.00% 200.36us 90 2.2260us 1.2800us 24.993us cudaFuncSetAttribute
0.00% 40.483us 3 13.494us 8.8010us 20.930us cuDeviceTotalMem
0.00% 33.154us 18 1.8410us 1.2800us 6.9450us cudaEventCreateWithFlags
0.00% 19.969us 1 19.969us 19.969us 19.969us cudaGetDevice
0.00% 11.904us 5 2.3800us 800ns 5.8240us cuDeviceGetCount
0.00% 10.785us 11 980ns 704ns 2.8160us cudaDeviceGetAttribute
0.00% 9.8570us 2 4.9280us 4.8000us 5.0570us cuInit
0.00% 8.7690us 3 2.9230us 2.6880us 3.0410us cuDeviceGetUuid
0.00% 4.6400us 2 2.3200us 1.9840us 2.6560us cuDriverGetVersion
0.00% 4.5760us 3 1.5250us 1.4080us 1.6000us cuDeviceGetName
0.00% 4.4160us 4 1.1040us 800ns 1.8240us cuDeviceGet
I am confused why there is so much time difference between clock_gettime() and nvprof result. Which method shows the accurate result? Could someone help me with this issue?
Thanks in Advance…!!