Expected CPU utilization during cudaMemcpyHostToDevice


I’m getting ready to implement my application in cuda, and started running some benchmarks on data copies back and forth. Here is my question.

What is the expected host CPU impact of cudaMemcpy calls?

I thought that DMA controller on the GTX card would take care of the actual cudaMemcpy call, and that host CPU would be involved to a minimum. Yet my benchmark shows one of the host CPU cores maxed out.

Here is my benchmark:

C:\ws\com.slytechs.netcapture-direct\src\cuda>cudabench 1000 64

device2device: size=64000MB time(1189)=1.189ms 420.521Gbps

host2Device: size=64000MB time(13007)=13.007ms 38.441Gbps

host2host: size=64000MB time(12551)=12.551ms 39.837Gbps

The main loops look something like this, but differ in cudaMemcpy copy type or use memcpy instead for host to host:

for (size_t i = 0; i < LOOP; i ++) {

		cudaMemcpy(dev_buffer, hst_buffer, BUF, cudaMemcpyHostToDevice);


My benchmark above, with a GTX560 card, uses 100% of one of my cores (Intel i7-970 3Ghz 6xCORE/12xTHREAD, triple DDR2-1600). The benchmark allocates “pinned” memory on the host using cudaHostAlloc and copies the 64MB block of memory 1000 times in a loop on the host using cudaMemcpy calls. The copy is done 3 different ways; device->device, host->device and host->host for comparison.

The loop itself that calls cudaMempcy or memcpy for host to host copies, only iterates 1000 each, so that would not account for 100% utilization of a core, except for the host to host, where I would expect the CPU to be pegged to the max. I just want to verify that what I am seeing is expected and I’m not doing something wrong. My application is very bandwidth intensive and I’m looking to offload from the host CPUs as much as possible, including the copies themselves.

Have you called [font=“Courier New”]cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync)[/font]? Otherwise spinning when waiting for completion will result in 100% CPU use.

I did not, but once set to cudaDeviceScheduleBlockingSync, the CPU usage drops down to 0% and bandwidth remains the same.

Exactly what I was hoping for that I might be doing wrong.

I’m a happy camper now.