CudaMemcpy() speed/bandwidth For host to device

Ive written a program which times CudaMemcpy() from host to device for an array of random floats. I’ve used various array sizes when copying (anywhere from 1kb to 256mb) and have only reached max bandwidth at ~1.5 GB/s for non-pinned host memory and bandwidth of ~ 3.0 GB/s for pinned host memory. I’m on a Intel Core2 2.4ghz machine with a gtx 285 and have a feeling that this bandwidth is quite low. I’ve seen anywhere from claims of ~18GB/s from sites like this ( – 3rd graph down – while others claim max speeds of 5 GB/s for Host to device transfers. Could anybody give some clarification or tips regarding this? Thanks in advance

Add on: When I run the bandwidth test that comes with the SDK, I get:

Running on…
device 0:GeForce GTX 285
Quick Mode
Host to Device Bandwidth for Pageable memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1475.6

Quick Mode
Device to Host Bandwidth for Pageable memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1390.8

Quick Mode
Device to Device Bandwidth
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 129423.0

&&&& Test PASSED

and for --memory=pinned for host to device…

Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3161.7

So is my program correct? Do I have a bad cuda driver/setup?



If you look carefully at the site you linked, it is 18 Gb/s which is 2.25 GB/s.

Regarding your performance, it looks decent for a PCIe v1 x16 slot. Or a v2 slot on some mainboards. It really comes down to the main board: different ones provide different PCIe transfer speeds. What mainboard / chipset do you have?

My own (an NVIDIA 780i chipset, which is PCIe v2 x16) also only gives about 3 GiB/s transfers to/from pinned memory. You’ll find that most of the reports of 5-6 are on newer core i7 boards which boast massive amounts of memory bandwidth in general. Although I do seem to recall some AMD chipsets reaching 5-6 when PCIe v2 first came out.

I see 4.7-5.2 GB/sec on my AMD 790FX motherboards w/ Phenom processors.

My Core2 Quad gets about 5GB/sec to and from pinned memory to both cards - or at least it does now that I’m off the 185.18.08-beta driver which was on the “Get CUDA” page for a very long time (I’ve now got the 185.18.14 driver). That cut my bandwidth in half - I’d suggest double checking which driver you’ve got.

Yeah that was my fault for not noticing that.

Thanks all for clarifying though. The key issue wasn’t so much the speed itself, but that the code I wrote was measuring the max bandwidth for whatever chipset/driver/setup (which now seems to be correct). I’ll be sure to check my drivers though.

Thanks again.

My Dell Precision 670 claiming to have PCIe v2 x16 gives following results:

Host to Device Bandwidth for Pinned memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2427.4

Device to Host Bandwidth for Pinned memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2691.6