Hey CUDA community,
Maybe nVidia folks can comment on this or someone could please point me to an nVidia best practices doc?
I read CUDA API docs and ran a search on these forums and cannot find a best practice recommendation regarding the scenario, when low CPU load is preferred, and slight CUDA kernel latency is tolerable. See, e.g. my search results.
I’m talking about the following usage pattern, which I think is the traditional one, and from which I intentionally excluded error checking, for clarity:
cudaSetDevice(…)
cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync)
cudaMalloc(…)
loop many times:
cudaMemcpy(… cudaMemcpyHostToDevice)
kernel<<<…>>>(…)
cudaDeviceSynchronize(…) ← useful for error checking ASAP
cudaMemcpy(… cudaMemcpyDeviceToHost)
cudaFree(…)
What kind of a CPU load overhead, caused by cudaDeviceSynchronize(…), should we expect for a process which iteratively runs a compute-heavy kernel, following the above pattern?
Besides high CPU load (about 50% per CPU core running my CUDA host code), I believe I’m observing continuous and substantial device->host PCI transfer, which I can only attribute to the synchronization overhead. From my observations, my own code’s PCI usage is asymmetric, pushing data predominantly to the device, while nvidia-smi reports asymmetric PCI load predominantly from the device. I’m not sure which tool I can use to confirm my suspicion w.r.t. PCI utilization overhead.
Is there a demo program, written according to nVidia best practices, and known to exhibit low CPU utilization and low PCI usage, when data prep on the CPU, required for running the kernel, is insignificant (compared to the kernel run time) and data transfers over PCI are low?
Thank you in advance for your insights!