Data transfer speed between G80 and main memory

I just got my new machine (DELL XPS) with G80 GPU. While learning CUDA, I wrote a small program to test the data transfer speed between the CPU main memory and graphic card on PCIE 16x bus. The specification of PCIE 16x is 4GB/s, my program indicated the speed on my machine is only 1.5GB/s, which is computed from measuring the CPU time of the function “cudaMemcpy” on 200M data from host->device and device->host. Is it normal or a little bit slow? Thanks.


It is extremely difficult, if not impossible, to get that level of performance on download/readback with OpenGL or DirectX. For example, the download used by gpubench is ~800 MB/sec.…s/8800GTX-0003/

The CUDA programming manual talks about observed peaks of 2 GB/sec. So you are already pretty close. eelsen is right in pointing out that downloads through graphics APIs are even slower. IMHO backed by my personal findings, the 4 GB/sec is pure marketing blabla found by theoretical considerations. Will not occur in practice.


eelsen, prkipfer, Thank you for your feedback. I will go ahead and do the real CUDA programming.

I observed 2.5GB/s(host->device) on my g80 gtx with pinned memory mode.

Check out the optimized data transfer sample in the new SDK.


Yes, using the special mem allocation, I indeed can now get more than 3 GB/sec. Very cool. Wonder why this wasn’t possible with the graphics libs before.


This requires allocating non-pageable pinned system memory. The GPU can DMA from this memory. Thus, if you can create your data in this memory, you only need to DMA to the GPU. If you don’t, the driver has to memcpy from your array to its pinned memory (possibly in chunks), and then DMA. Therefore most transfers to the GPU are limited by CPU and chipset performance in addition to PCI-e performance.

HOWEVER, If you allocate too much pinned memory, you can bring your system to its knees. Therefore the graphics APIs don’t expose this sort of allocation for graphics data structures.

When you use pinned memory, you do so at your own (and your users’ own) risk. On fixed platforms (embedded systems, clusters, etc.), I expect pinned memory to be very useful because you can experiment to figure out how much is safe to use.

On desktop applications, you should do extensive testing to figure out what works on a variety of PC configurations.


Thanks Mark. I see, makes a lot of sense for graphics APIs.

Is there a standard approach to find out how much pinned memory is free/currently used by system/how much there is at all? On Linux my first guess would be to inspect /proc/mtrr and /proc/meminfo. Has anyone more info on this topic?


It’s difficult to document rules of thumb for developers to follow. According to some references, non-pageable memory is a scarce resource. Theoretically, you can allocate all of system memory as page-locked. We have done some basic testing on this. For example, one directed test measured 3DMark06 CPU performance degradation with different amounts of memory pinned with cuMemAllocHost(). The benchmark scores didn’t drop noticeably until we had locked down more than half of physical memory! And we were able to pin down 3/4 of physical memory before the allocations started failing. Normal tasks like Web browsing were noticeably slower at that point. We suspect that if we were timing more memory intensive tasks (for example full builds of large software projects), performance degradations would have shown up sooner.

This is all very system dependent, which is why we have to be vague in the documentation. Our best advice is to proceed with caution, make directed tests, and be conservative if your app will be deployed on a wide variety of systems.


where is that?



I have a MSI P6N Diamond MB with north and southbridge.
nFORCE 680 SLI chipset.

I have pugged in 3 cards:

on northbridge:
PCI-Ex8 --> 1.5GB/sec throughput in each direction
PCI-Ex8 --> 1.5GB/sec throughput in each direction

on southbridge
PIC-Ex16 --> HostToDevice 750Mb/sec
DeviceToHost 325MB/sec

Can somebody explain that. Is southbridge slow. Do I have to clock it higher in the Bios ? Why is PCIEx16 then on the southbridge ?


Just a comment about units. The GB/s in the above specification is Gigabytes/s.

However, the GB/s in NVIDIA’s specification of memory bandwidth is Gigabits/s

(for example, Thus, the 8800GTX has

a memory bandwidth of 86.4 Gigabits/s = 86.4/8 = 10.8 Gigabytes/s, which is only 2.7

times faster than that of PCIE. Am I right?

No, it is 86.4 GB/s (Gigabytes/s).

384 bits (width of the interface) / 8 ( to go to byte) * 1.8 GHz (memory clock) = 86.4 GB/s

can this be achieved in practice? The best what I can get and heard of was 70 GB/s.

if you use overclocked cards like GeForce Ultra from XFX you can theoretical get 104GByte/sec and reach for copy throughput 85GB/sec. in my system they run stable.

yes, but this is theoretically.
The results in practise are way below these values.

My experience has shown that cudaMallocHost can fail even when posix_memalign+mlock or mmap(ANONYMOUS|LOCKED) will work. Does anyone know wny? Usually this occurs with buffers over 256 MB, if I recall correctly.