Best practices for cudaDeviceScheduleBlockingSync usage pattern on Linux

Hey CUDA community,

Maybe nVidia folks can comment on this or someone could please point me to an nVidia best practices doc?

I read CUDA API docs and ran a search on these forums and cannot find a best practice recommendation regarding the scenario, when low CPU load is preferred, and slight CUDA kernel latency is tolerable. See, e.g. my search results.

I’m talking about the following usage pattern, which I think is the traditional one, and from which I intentionally excluded error checking, for clarity:

cudaSetDevice(…)
cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync)

cudaMalloc(…)

loop many times:
cudaMemcpy(… cudaMemcpyHostToDevice)
kernel<<<…>>>(…)
cudaDeviceSynchronize(…) ← useful for error checking ASAP
cudaMemcpy(… cudaMemcpyDeviceToHost)

cudaFree(…)

What kind of a CPU load overhead, caused by cudaDeviceSynchronize(…), should we expect for a process which iteratively runs a compute-heavy kernel, following the above pattern?

Besides high CPU load (about 50% per CPU core running my CUDA host code), I believe I’m observing continuous and substantial device->host PCI transfer, which I can only attribute to the synchronization overhead. From my observations, my own code’s PCI usage is asymmetric, pushing data predominantly to the device, while nvidia-smi reports asymmetric PCI load predominantly from the device. I’m not sure which tool I can use to confirm my suspicion w.r.t. PCI utilization overhead.

Is there a demo program, written according to nVidia best practices, and known to exhibit low CPU utilization and low PCI usage, when data prep on the CPU, required for running the kernel, is insignificant (compared to the kernel run time) and data transfers over PCI are low?

Thank you in advance for your insights!

I’m not aware of demo programs or best practices documents for this.

There is the documentation of course.

When I compile the following test program with -DUSE_BLOCK and run the program and observe its behavior using top, it reports approximately 0% CPU utilization. (%CPU).

const unsigned long long my_duration = 20000000000ULL;

__global__ void kwait(unsigned long long duration){
        unsigned long long start = clock64();
        while (clock64() < start+duration);
}

int main(){

#ifdef USE_BLOCK
  cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
#endif
#ifdef USE_YIELD
  cudaSetDeviceFlags(cudaDeviceScheduleYield);
#endif
#ifdef USE_SPIN
  cudaSetDeviceFlags(cudaDeviceScheduleSpin);
#endif
  kwait<<<1,1>>>(my_duration);
  cudaDeviceSynchronize();
}

(CUDA 11.3, GTX 960, 465.19.01)

If I compile with either -DUSE_YIELD or -DUSE_SPIN I observe approximately 100% CPU utilization.

My rationale is as follows:

  • yield allows the CPU thread (waiting at cudaDeviceSynchronize()) to yield behavior to other thread work, if any, but does not put any threads “to sleep”. If there are no other threads waiting, the CPU thread will spin at the sync point.
  • spin does what you expect: there is no thread negotiation, and the CPU thread (waiting …) just spins at the sync point
  • blocking sync somehow puts the CPU thread to sleep, waking it occasionally, to check for completion of the device activity.

I don’t have anything to share beyond that. I’m generally not able to disclose things that are not documented (excepting those things which are discoverable experimentally of course). If you feel like CUDA documentation is lacking you are encouraged to file a bug.

Thank you Robert, the demo you posted is a great start! I reran the demo on my system and can confirm that I also see low CPU utilization, exhibited by your code, which is reassuring. I’ve made your kernel a bit more computationally complex (so that the compiler cannot optimize it out), invoked it on a large block set, and I still see very low CPU utilization.

Next question: what tool should I use to attribute high CPU and PCI utilization that I’m seeing in my own program to a particular function or code line number, be it within CUDA toolchain or within my code? I tried nvvp, and while I do see kernel-related performance numbers there, I am not sure where to look for CPU-related information.

Thank you again for a prompt and detailed reply!

I don’t think the compiler can optimize out the code I wrote. It’s easy to demonstrate that is not happening.

what the kernel is actually doing has no bearing on the CPU thread waiting activity. The only thing that matters is the kernel duration.

I’m not sure. I think the best suggestion is one of the newer profilers, but I don’t know how you observed CPU or PCI activity in the first place. Certainly a profiler is what I would use to attribute CPU activity. Asking a profiler for CPU hotspots is a standard methodology. I’m not sure I can give you a “recipe” here. The newer profilers (nsight systems) can give you various CPU measurements, but attributing CPU load is not something that could only be done with a GPU profiler. Learning to use any CPU profiler would probably be workable here as well. For example gprof can identify CPU hotspots.

Hey Robert,

I’ve used various CPU profilers before, including on my program that I’m developing now. I’ll give one of those another try.

Speaking of the PCI activity, I’m running

nvidia-smi dmon -d 10 -s ucmt -o T

and I’m observing last two columns in the output table, rxpci and txpci. The doc for this utility mentions “receive” and “transmission” throughput. I’m wondering whether these definitions are device- or host-centric: does txpci measure the rate at which the device transmits or the rate at which the host transmits? It’d be great to see the attribution of these receive/transmit activities to the particular calls in my program, similar to how profilers indicate hotspots.

Thanks again for your explanations.

The profilers can attribute PCIE activity to explicit calls in your program (e.g. cudaMemcpyAsync). However they are not going to do a great job of exposing all the traffic that may occur due to the management of the GPU by the CUDA runtime (i.e. the “operating system” for the GPU). I don’t know of a tool to do that. Synchronization, and whatever it takes to accomplish it, is a function of CUDA runtime behavior. If this is very important, you could probably construct microbenchmarking studies perhaps combined with your usage of nvidia-smi (which knows essentially nothing about your CUDA code). Similar microbenchmarking should be able to answer your question about what the traffic directions are (I don’t happen to know.)