4 byte transfers from device to host are extremely slow.

Tobias_Sargeant · May 25, 2009, 3:43am

Hi,

I don’t know whether this is an artifact of the system I’m using, or is otherwise to be expected, but I thought I’d post some findings regarding device->host transfers.

After profiling my code (CPU-side) and learning that cuCopyDtoH was responsible for about 97% of host execution time, I decided to take a look at just how long the transfers were taking.

I did some timings on host to device transfers by instrumenting my code as follows:

[codebox]

double currTime() {

struct timeval tv;

gettimeofday(&tv, NULL);

return tv.tv_sec + tv.tv_usec * 10e-6;

}

void copyDeviceToHost(void *tgt, CUdeviceptr src, size_t sz) {

double t1, t2;

t1 = currTime();

failOnCUDAErr(cuMemcpyDtoH(tgt, src, sz));

t2 = currTime();

log(“d->h copy: %u bytes, %f sec\n”, sz, t2-t1);

}

[/codebox]

I was expecting that small transfers would be somewhat inefficient, but the results were actually kind of surprising. The blue points in the attached graph are transfers of >4 bytes in size, whereas the red points are 4 byte transfers.

4 byte transfers have a median execution time of .22918 sec (!), whereas the median execution time of all >4 byte transfers is 0.00013 sec.

This is on a macbook pro using the inbuilt 9400M:

[codebox]

Model Name: MacBook Pro

Model Identifier: MacBookPro5,1

Processor Name: Intel Core 2 Duo

Processor Speed: 2.8 GHz

Number Of Processors: 1

Total Number Of Cores: 2

L2 Cache: 6 MB

Memory: 4 GB

Bus Speed: 1.07 GHz

Boot ROM Version: MBP51.0074.B01

SMC Version: 1.33f8

NVIDIA GeForce 9400M:

Chipset Model: NVIDIA GeForce 9400M

Type: Display

Bus: PCI

VRAM (Total): 256 MB

Vendor: NVIDIA (0x10de)

Device ID: 0x0863

Revision ID: 0x00b1

ROM Revision: 3343

gMux Version: 1.7.3

[/codebox]

Tobias_Sargeant · May 25, 2009, 5:25am

Further testing revealed that the there’s a large initial setup cost associated with the first device to host transfer following a kernel execution - regardless of the size of the transfer. This doesn’t seem to be the case for host to device transfers (pinned memory is used in both cases).

Is there any way to avoid this? Alternatively, is there any way to quickly return a single 4 byte value from a kernel to the CPU?

Cheers,
Toby.

Sarnath · May 25, 2009, 5:40am

The first CUDA call (usually cudaMalloc) always has an initilaiztion overhead. (Its documented in the guide. check out “cuInit”)

No way to avoid it. (if this is what u r facing)

Tobias_Sargeant · May 25, 2009, 5:55am

I suspect now that what’s happening is that the time taken for the computation is being hidden by an implicit synchronization done by the first cuMemcpyDtoH(). So on that basis the time taken to fetch data from the device is actually small, and I should apologise for the above. Sorry. :)

Toby.

Sarnath · May 25, 2009, 6:12am

Use “cudaThreadSynchronize()” to be doubly sure and then you will know what “cudaMemcpy” is costing you. Good Luck!

Also, there was a thread posted by “Gregory Daimos” – that talks about efficient cudaMemcpys. (how to structure data to get max efficiency etc…)

Gregory and his team have done lot of research ( georgia tech) and they were the first ones to develop PTX emulator or so. So, you can definitely count on their findings…

Check out, if you would like to.

Tobi_W · May 25, 2009, 6:38am

Do you have a link? Neither google nor the forum search was successful…

Thanks in advance.

tmurray · May 25, 2009, 6:57am

Uh, if you’re using a 9400M, why are you doing memcpys at all? Just use zero-copy/mapped memory and all of that to have exactly zero memcpy overhead.

Sarnath · May 25, 2009, 7:14am

@Toby,

[url=“http://forums.nvidia.com/index.php?showtopic=96058”]http://forums.nvidia.com/index.php?showtopic=96058[/url]

Thats the link.

–edit–
Well, Now i see it is related to intra-GPU memcpy and nothing to do with “cudaMemcpy”.
Nonetheless, this should be interesting to you. Sorry.

–edit–
The page in the link above has a link to “dubinsky’s results” which actually deal with “cudaMemcpy”. It must make an interesting read.
Otherwise, Gregory’s templated code looks to be redundant. – One can just get away with ‘VOID’ pointer - which actually serves better template for pointers in general.

Topic		Replies	Views
Slow memory transfers CUDA Programming and Performance	7	1998	May 23, 2011
cudaMemcpy host->device and device->host speed CUDA Programming and Performance	6	15245	April 29, 2014
Device to Host memcpy How do i make this faster? CUDA Programming and Performance	2	2514	February 6, 2008
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
`cudaMemcpyHostToDevice` is very slow CUDA Programming and Performance	8	1988	December 14, 2018
Slow device to host transfer CUDA Programming and Performance	1	3095	June 14, 2007
The change of speed when copying data between host and device CUDA Programming and Performance pcie , cuda , linux	5	1931	October 12, 2021
Device to Device cudaMemcpy performance CUDA Programming and Performance cuda	5	10499	March 24, 2021
cudaMemcpy2D() and a few gray hairs It's very slow CUDA Programming and Performance	8	4536	February 13, 2009
cudaMemcpyDeviceToHost time procces CUDA Programming and Performance	6	3016	August 1, 2008

4 byte transfers from device to host are extremely slow.

Related topics