This is a rather high-level question. I have been trying out the “PC sampling” feature of CUPTI. In particular, I’ve essentially taken the the pc_sampling CUPTI CUDA sample program, and added it to my program:
//
// Register callbacks for when CUPTI_ACTIVITY_KIND_PC_SAMPLING records arrive.
//
CUPTI_CALL(cuptiActivityRegisterCallbacks(bufferRequested, bufferCompleted));
//
// Enable PC sampling activities.
//
CUPTI_CALL(cuptiActivityEnable(CUPTI_ACTIVITY_KIND_PC_SAMPLING));
//
// Use the minimal sampling interval (to reduce sampling runtime overheads).
//
CUpti_ActivityPCSamplingConfig configPC;
CUcontext cuCtx;
configPC.size = sizeof(CUpti_ActivityPCSamplingConfig);
configPC.samplingPeriod=CUPTI_ACTIVITY_PC_SAMPLING_PERIOD_MIN;
configPC.samplingPeriod2 = 0;
cuCtxGetCurrent(&cuCtx);
CUPTI_CALL(cuptiActivityConfigurePCSampling(cuCtx, &configPC));
My question is, what amount of run-time overhead should I expect to experience with the PC sampling feature enabled as above (e.g. 50%, 100%?)? I ask since I am encountering REALLY big overheads (like… I’m talking 1000x or more), just from adding the above code, and with empty bufferCompleted callbacks that simply ignore activity records.
I also tried enabling PC sampling by running my program with nvprof, but I had the exact same slow-down behaviour:
$ nvprof --source-level-analysis pc_sampling -o profile.out ...
System configuration:
- CUDA docker image: 10.1-cudnn7-devel-ubuntu18.04
- GPU: Quadro P4000
- GPU driver version: 418.67
Coming from the world of VTune, I find this overhead really surprising. I thought the whole point of a sampling profiler was to have minimal runtime overhead… am I missing something, or is this expected behaviour?
Thanks in advance,
James