4 byte transfers from device to host are extremely slow.


I don’t know whether this is an artifact of the system I’m using, or is otherwise to be expected, but I thought I’d post some findings regarding device->host transfers.

After profiling my code (CPU-side) and learning that cuCopyDtoH was responsible for about 97% of host execution time, I decided to take a look at just how long the transfers were taking.

I did some timings on host to device transfers by instrumenting my code as follows:


double currTime() {

struct timeval tv;

gettimeofday(&tv, NULL);

return tv.tv_sec + tv.tv_usec * 10e-6;


void copyDeviceToHost(void *tgt, CUdeviceptr src, size_t sz) {

double t1, t2;

t1 = currTime();

failOnCUDAErr(cuMemcpyDtoH(tgt, src, sz));

t2 = currTime();

log(“d->h copy: %u bytes, %f sec\n”, sz, t2-t1);



I was expecting that small transfers would be somewhat inefficient, but the results were actually kind of surprising. The blue points in the attached graph are transfers of >4 bytes in size, whereas the red points are 4 byte transfers.

4 byte transfers have a median execution time of .22918 sec (!), whereas the median execution time of all >4 byte transfers is 0.00013 sec.

This is on a macbook pro using the inbuilt 9400M:


Model Name: MacBook Pro

Model Identifier: MacBookPro5,1

Processor Name: Intel Core 2 Duo

Processor Speed: 2.8 GHz

Number Of Processors: 1

Total Number Of Cores: 2

L2 Cache: 6 MB

Memory: 4 GB

Bus Speed: 1.07 GHz

Boot ROM Version: MBP51.0074.B01

SMC Version: 1.33f8

NVIDIA GeForce 9400M:

Chipset Model: NVIDIA GeForce 9400M

Type: Display

Bus: PCI

VRAM (Total): 256 MB

Vendor: NVIDIA (0x10de)

Device ID: 0x0863

Revision ID: 0x00b1

ROM Revision: 3343

gMux Version: 1.7.3


Further testing revealed that the there’s a large initial setup cost associated with the first device to host transfer following a kernel execution - regardless of the size of the transfer. This doesn’t seem to be the case for host to device transfers (pinned memory is used in both cases).

Is there any way to avoid this? Alternatively, is there any way to quickly return a single 4 byte value from a kernel to the CPU?


The first CUDA call (usually cudaMalloc) always has an initilaiztion overhead. (Its documented in the guide. check out “cuInit”)

No way to avoid it. (if this is what u r facing)

I suspect now that what’s happening is that the time taken for the computation is being hidden by an implicit synchronization done by the first cuMemcpyDtoH(). So on that basis the time taken to fetch data from the device is actually small, and I should apologise for the above. Sorry. :)


Use “cudaThreadSynchronize()” to be doubly sure and then you will know what “cudaMemcpy” is costing you. Good Luck!

Also, there was a thread posted by “Gregory Daimos” – that talks about efficient cudaMemcpys. (how to structure data to get max efficiency etc…)

Gregory and his team have done lot of research ( georgia tech) and they were the first ones to develop PTX emulator or so. So, you can definitely count on their findings…

Check out, if you would like to.

Do you have a link? Neither google nor the forum search was successful…

Thanks in advance.

Uh, if you’re using a 9400M, why are you doing memcpys at all? Just use zero-copy/mapped memory and all of that to have exactly zero memcpy overhead.



Thats the link.

Well, Now i see it is related to intra-GPU memcpy and nothing to do with “cudaMemcpy”.
Nonetheless, this should be interesting to you. Sorry.

The page in the link above has a link to “dubinsky’s results” which actually deal with “cudaMemcpy”. It must make an interesting read.
Otherwise, Gregory’s templated code looks to be redundant. – One can just get away with ‘VOID’ pointer - which actually serves better template for pointers in general.