I try to profile a kernel but I have a discrepancy between the result returned by nvprof and by cudaEventElapsedTime.
I am trying to profile the execution of this kernel :
cudaProfilerStart() ; HANDLE_ERROR(cudaEventRecord(Start,0) ); MCSimuScatteringEffectLoop<<<Blocks,Threads>>>(pDevGPUCall); HANDLE_ERROR(cudaPeekAtLastError()) ; HANDLE_ERROR(cudaDeviceSynchronize()) ; HANDLE_ERROR(cudaEventRecord(Stop,0) ); HANDLE_ERROR(cudaEventSynchronize(Stop)) ; HANDLE_ERROR(cudaEventElapsedTime(&ScatteringElapsedTime,Start,Stop)) ; cudaProfilerStop() ; std::cout << "ScatteringElapsedTime " << ScatteringElapsedTime << std::endl ;
Then I just run the piece of code and I have the following output :
Now if I am running the piece of code with nvprof
nvprof --profile-from-start off --print-gpu-trace : ==22577== Profiling result: Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name 2.20815s 283.33ms (32 1 1) (256 1 1) 63 0B 0B - - Quadro K4200 (0 1 7 MCSimuScatteringEffectLoop(GPUCallType*) 
GPU : Nvidia Quadro 4200
NVCC : 7.5
OS : CentOS 6.6
I don’t know why I am 160 ms missing between nvprof and the result computed by cudaEventElapsedTime().
On my side I am expecting to have something around 280 ms, so I guess the result of the profiler seems correct.
Do you know why I have this discrepancy ?
Thank you very much.