Discrepancy between cudaEventElapsedTime and nvprof

Hello

I try to profile a kernel but I have a discrepancy between the result returned by nvprof and by cudaEventElapsedTime.

I am trying to profile the execution of this kernel :

cudaProfilerStart() ;
  HANDLE_ERROR(cudaEventRecord(Start,0) );

  MCSimuScatteringEffectLoop<<<Blocks,Threads>>>(pDevGPUCall);
  
  HANDLE_ERROR(cudaPeekAtLastError()) ;
  HANDLE_ERROR(cudaDeviceSynchronize()) ;
  HANDLE_ERROR(cudaEventRecord(Stop,0) );
  HANDLE_ERROR(cudaEventSynchronize(Stop)) ;
  HANDLE_ERROR(cudaEventElapsedTime(&ScatteringElapsedTime,Start,Stop)) ;
  cudaProfilerStop() ;
  std::cout << "ScatteringElapsedTime       " << ScatteringElapsedTime << std::endl ;

Then I just run the piece of code and I have the following output :

ScatteringElapsedTime       444.185638428

Now if I am running the piece of code with nvprof

nvprof --profile-from-start off --print-gpu-trace  :

==22577== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
2.20815s  283.33ms             (32 1 1)       (256 1 1)        63        0B        0B         -           -  Quadro K4200 (0         1         7  MCSimuScatteringEffectLoop(GPUCallType*) [180]

Configurations:
GPU : Nvidia Quadro 4200
NVCC : 7.5
OS : CentOS 6.6

I don’t know why I am 160 ms missing between nvprof and the result computed by cudaEventElapsedTime().
On my side I am expecting to have something around 280 ms, so I guess the result of the profiler seems correct.
Do you know why I have this discrepancy ?

Thank you very much.

Regards
Ttart398

take the cuda device synchronize out of the timing region.

alternatively, leave it there and place another cudaDeviceSynchronize() before the start of your timing (i.e. before the event record on the start event).

That device synchronize that you have there will pick up the cost of any previous asynchronous activity. If you have previous asynchronous activity, this will cause a discrepancy between the profiler measurement of that kernel and the actual time recorded by the event timing.

I try the two solution described above, but the problem persists:

Solution 1 :

cudaProfilerStart() ;
  HANDLE_ERROR(cudaEventRecord(Start,0) );

  MCSimuScatteringEffectLoop<<<Blocks,Threads>>>(pDevGPUCall);
  

  HANDLE_ERROR(cudaEventRecord(Stop,0) );
  HANDLE_ERROR(cudaEventSynchronize(Stop)) ;
  HANDLE_ERROR(cudaEventElapsedTime(&ScatteringElapsedTime,Start,Stop)) ;
  std::cout << "ScatteringElapsedTime       " << ScatteringElapsedTime << std::endl ;
  HANDLE_ERROR(cudaPeekAtLastError()) ;
  HANDLE_ERROR(cudaDeviceSynchronize()) ;
  cudaProfilerStop() ;

Output

ScatteringElapsedTime       435.63571167

Solution 2 :

cudaProfilerStart() ;
  HANDLE_ERROR(cudaDeviceSynchronize()) ;
  HANDLE_ERROR(cudaEventRecord(Start,0) );

  MCSimuScatteringEffectLoop<<<Blocks,Threads>>>(pDevGPUCall);
  

  HANDLE_ERROR(cudaEventRecord(Stop,0) );
  HANDLE_ERROR(cudaEventSynchronize(Stop)) ;
  HANDLE_ERROR(cudaEventElapsedTime(&ScatteringElapsedTime,Start,Stop)) ;
  std::cout << "ScatteringElapsedTime       " << ScatteringElapsedTime << std::endl ;
  HANDLE_ERROR(cudaPeekAtLastError()) ;
  HANDLE_ERROR(cudaDeviceSynchronize()) ;
  HANDLE_ERROR(cudaEventDestroy(Start)) ;
  HANDLE_ERROR(cudaEventDestroy(Stop)) ;
  cudaProfilerStop() ;

Ouput :

ScatteringElapsedTime       444.27923584

I have done a little bit of testing : when I launching my kernel with no data to process I have an execution (returned by the event) of about 150 ms wheres as I am expecting something like 10 ms.
I try to profile this my kernel with nvvp but with no result.

After looking at nvvp I have a strange output. I just calling my kernel with no data to process :

duration of MCSimuScatteringEffectLoop is about 187 us
duration of the cudaLaunch associated to MCSimuScatteringEffectLoop is about 15 ms

Does anyone knows about this kind of issue ?
Regards

This is all very mysterious. A kernel launch with a null kernel should be on the order of 5 milliseconds. My first thought was this is WDDM batching on Windows, but now I see you are on Linux. Are you running Linux natively on the bare metal, or are you running in a virtualization (hypervisor) environment by any chance?

I am running my program on a native linux but I am not running a null kernel, I am running a kernel with one data to process (so one thread is active).

I Finally found the issue :
I was allocating a buffer for printing error information. Once I commented the line :

cudaDeviceSetLimit(cudaLimitPrintfFifoSize,4096*4096*50);

Now the offset of 160 ms is gone.

Regards