This is a rather high-level question. I have been trying out the “PC sampling” feature of CUPTI. In particular, I’ve essentially taken the the pc_sampling CUPTI CUDA sample program, and added it to my program:
// // Register callbacks for when CUPTI_ACTIVITY_KIND_PC_SAMPLING records arrive. // CUPTI_CALL(cuptiActivityRegisterCallbacks(bufferRequested, bufferCompleted)); // // Enable PC sampling activities. // CUPTI_CALL(cuptiActivityEnable(CUPTI_ACTIVITY_KIND_PC_SAMPLING)); // // Use the minimal sampling interval (to reduce sampling runtime overheads). // CUpti_ActivityPCSamplingConfig configPC; CUcontext cuCtx; configPC.size = sizeof(CUpti_ActivityPCSamplingConfig); configPC.samplingPeriod=CUPTI_ACTIVITY_PC_SAMPLING_PERIOD_MIN; configPC.samplingPeriod2 = 0; cuCtxGetCurrent(&cuCtx); CUPTI_CALL(cuptiActivityConfigurePCSampling(cuCtx, &configPC));
My question is, what amount of run-time overhead should I expect to experience with the PC sampling feature enabled as above (e.g. 50%, 100%?)? I ask since I am encountering REALLY big overheads (like… I’m talking 1000x or more), just from adding the above code, and with empty bufferCompleted callbacks that simply ignore activity records.
I also tried enabling PC sampling by running my program with nvprof, but I had the exact same slow-down behaviour:
$ nvprof --source-level-analysis pc_sampling -o profile.out ...
- CUDA docker image: 10.1-cudnn7-devel-ubuntu18.04
- GPU: Quadro P4000
- GPU driver version: 418.67
Coming from the world of VTune, I find this overhead really surprising. I thought the whole point of a sampling profiler was to have minimal runtime overhead… am I missing something, or is this expected behaviour?
Thanks in advance,