PC Sampling leads to large slow-downs in execution time?

This is a rather high-level question. I have been trying out the “PC sampling” feature of CUPTI. In particular, I’ve essentially taken the the pc_sampling CUPTI CUDA sample program, and added it to my program:

//
// Register callbacks for when CUPTI_ACTIVITY_KIND_PC_SAMPLING records arrive.
// 
CUPTI_CALL(cuptiActivityRegisterCallbacks(bufferRequested, bufferCompleted));


//
// Enable PC sampling activities.
//
CUPTI_CALL(cuptiActivityEnable(CUPTI_ACTIVITY_KIND_PC_SAMPLING));


//
// Use the minimal sampling interval (to reduce sampling runtime overheads).
//
CUpti_ActivityPCSamplingConfig configPC;
CUcontext cuCtx;
configPC.size = sizeof(CUpti_ActivityPCSamplingConfig);
configPC.samplingPeriod=CUPTI_ACTIVITY_PC_SAMPLING_PERIOD_MIN;
configPC.samplingPeriod2 = 0;
cuCtxGetCurrent(&cuCtx);
CUPTI_CALL(cuptiActivityConfigurePCSampling(cuCtx, &configPC));

My question is, what amount of run-time overhead should I expect to experience with the PC sampling feature enabled as above (e.g. 50%, 100%?)? I ask since I am encountering REALLY big overheads (like… I’m talking 1000x or more), just from adding the above code, and with empty bufferCompleted callbacks that simply ignore activity records.

I also tried enabling PC sampling by running my program with nvprof, but I had the exact same slow-down behaviour:

$ nvprof --source-level-analysis pc_sampling -o profile.out ...

System configuration:

  • CUDA docker image: 10.1-cudnn7-devel-ubuntu18.04
  • GPU: Quadro P4000
  • GPU driver version: 418.67

Coming from the world of VTune, I find this overhead really surprising. I thought the whole point of a sampling profiler was to have minimal runtime overhead… am I missing something, or is this expected behaviour?

Thanks in advance,
James

We have not seen a such as large (1000x) overhead. With the version of driver you are using - for some application we have seen a 20x to 50x overhead (for the lowest sampling rate and highest sampling rate respectively). We have made some improvements and have been able to reduce the overhead to 2x to 5x (for the same application). But this improvement is available only for Volta and later GPU architectures. Note that these specific overhead measurements were made using CUPTI and not nvprof. We will try and check the overhead numbers with nvprof.

Note that one reason for the slowdown with PC sampling is due to kernel serialization (in case your application is using concurrent kernels).

Also the overhead will depend on:

  • number of kernel launches; and
  • pc sampling period

What are the number of GPU kernel launches in the application?
You can check the overhead with collecting PC sampling data for a single kernel launch (using the nvprof “–kernels” option).

You can also try increasing the PC sampling period using the nvprof “–pc-sampling-period” option.