Data transfer speed between G80 and main memory

imagination · March 22, 2007, 3:31pm

Hi,
I just got my new machine (DELL XPS) with G80 GPU. While learning CUDA, I wrote a small program to test the data transfer speed between the CPU main memory and graphic card on PCIE 16x bus. The specification of PCIE 16x is 4GB/s, my program indicated the speed on my machine is only 1.5GB/s, which is computed from measuring the CPU time of the function “cudaMemcpy” on 200M data from host->device and device->host. Is it normal or a little bit slow? Thanks.

Jeremy

eelsen · March 22, 2007, 5:18pm

It is extremely difficult, if not impossible, to get that level of performance on download/readback with OpenGL or DirectX. For example, the download used by gpubench is ~800 MB/sec.

http://graphics.stanford.edu/projects/gpub…s/8800GTX-0003/

prkipfer · March 22, 2007, 5:36pm

The CUDA programming manual talks about observed peaks of 2 GB/sec. So you are already pretty close. eelsen is right in pointing out that downloads through graphics APIs are even slower. IMHO backed by my personal findings, the 4 GB/sec is pure marketing blabla found by theoretical considerations. Will not occur in practice.

Peter

imagination · March 22, 2007, 6:22pm

eelsen, prkipfer, Thank you for your feedback. I will go ahead and do the real CUDA programming.

iceberg · April 29, 2007, 1:26am

I observed 2.5GB/s(host->device) on my g80 gtx with pinned memory mode.

paulius · April 29, 2007, 3:03am

Check out the optimized data transfer sample in the new SDK.

Paulius

prkipfer · April 30, 2007, 12:50pm

Yes, using the special mem allocation, I indeed can now get more than 3 GB/sec. Very cool. Wonder why this wasn’t possible with the graphics libs before.

Peter

Mark_Harris · May 3, 2007, 10:43am

This requires allocating non-pageable pinned system memory. The GPU can DMA from this memory. Thus, if you can create your data in this memory, you only need to DMA to the GPU. If you don’t, the driver has to memcpy from your array to its pinned memory (possibly in chunks), and then DMA. Therefore most transfers to the GPU are limited by CPU and chipset performance in addition to PCI-e performance.

HOWEVER, If you allocate too much pinned memory, you can bring your system to its knees. Therefore the graphics APIs don’t expose this sort of allocation for graphics data structures.

When you use pinned memory, you do so at your own (and your users’ own) risk. On fixed platforms (embedded systems, clusters, etc.), I expect pinned memory to be very useful because you can experiment to figure out how much is safe to use.

On desktop applications, you should do extensive testing to figure out what works on a variety of PC configurations.

Mark

prkipfer · May 3, 2007, 12:21pm

Thanks Mark. I see, makes a lot of sense for graphics APIs.

Is there a standard approach to find out how much pinned memory is free/currently used by system/how much there is at all? On Linux my first guess would be to inspect /proc/mtrr and /proc/meminfo. Has anyone more info on this topic?

Peter

Mark_Harris · May 4, 2007, 2:02pm

It’s difficult to document rules of thumb for developers to follow. According to some references, non-pageable memory is a scarce resource. Theoretically, you can allocate all of system memory as page-locked. We have done some basic testing on this. For example, one directed test measured 3DMark06 CPU performance degradation with different amounts of memory pinned with cuMemAllocHost(). The benchmark scores didn’t drop noticeably until we had locked down more than half of physical memory! And we were able to pin down 3/4 of physical memory before the allocations started failing. Normal tasks like Web browsing were noticeably slower at that point. We suspect that if we were timing more memory intensive tasks (for example full builds of large software projects), performance degradations would have shown up sooner.

This is all very system dependent, which is why we have to be vague in the documentation. Our best advice is to proceed with caution, make directed tests, and be conservative if your app will be deployed on a wide variety of systems.

Mark

jon_sanders · December 8, 2007, 9:58pm

where is that?

Jon

jtoelke · December 18, 2007, 11:29am

Help!

I have a MSI P6N Diamond MB with north and southbridge.
nFORCE 680 SLI chipset.

I have pugged in 3 cards:

on northbridge:
PCI-Ex8 → 1.5GB/sec throughput in each direction
PCI-Ex8 → 1.5GB/sec throughput in each direction

on southbridge
PIC-Ex16 → HostToDevice 750Mb/sec
DeviceToHost 325MB/sec

Can somebody explain that. Is southbridge slow. Do I have to clock it higher in the Bios ? Why is PCIEx16 then on the southbridge ?

Help!

nasacort · December 18, 2007, 9:13pm

Just a comment about units. The GB/s in the above specification is Gigabytes/s.

However, the GB/s in NVIDIA’s specification of memory bandwidth is Gigabits/s

(for example, http://www.nvidia.com/page/geforce8.html). Thus, the 8800GTX has

a memory bandwidth of 86.4 Gigabits/s = 86.4/8 = 10.8 Gigabytes/s, which is only 2.7

times faster than that of PCIE. Am I right?

mfatica · December 18, 2007, 9:19pm

No, it is 86.4 GB/s (Gigabytes/s).

384 bits (width of the interface) / 8 ( to go to byte) * 1.8 GHz (memory clock) = 86.4 GB/s

vvolkov · December 19, 2007, 4:03pm

can this be achieved in practice? The best what I can get and heard of was 70 GB/s.

jtoelke · January 11, 2008, 12:40pm

if you use overclocked cards like GeForce Ultra from XFX you can theoretical get 104GByte/sec and reach for copy throughput 85GB/sec. in my system they run stable.

VanDammage · January 11, 2008, 1:58pm

yes, but this is theoretically.
The results in practise are way below these values.

bbudge · January 26, 2008, 12:15am

My experience has shown that cudaMallocHost can fail even when posix_memalign+mlock or mmap(ANONYMOUS|LOCKED) will work. Does anyone know wny? Usually this occurs with buffers over 256 MB, if I recall correctly.

Topic		Replies	Views
The speed of data transfer between GPU and CPU CUDA Programming and Performance	4	2691	April 27, 2009
PCI Express x16 bandwidth - host<->device transfer Bandwidth is much lower than should be CUDA Programming and Performance	38	68185	April 18, 2008
CudaMemcpy() speed/bandwidth For host to device CUDA Programming and Performance	5	10027	June 30, 2009
What factors effect GPU transfer speed? CUDA Programming and Performance	7	9174	September 15, 2009
About Data transfer speed between CPU and GPU? How to increase the data transfer speed? CUDA Programming and Performance	7	15577	December 11, 2009
Bad PCIe transfer performance (cudaMemcpy), what can cause that? CUDA Programming and Performance	10	11606	September 20, 2010
Optimize data transfer rate from host to device CUDA Programming and Performance	3	2834	July 27, 2017
Host2Device bandwidth, Kepler VS Fermi CUDA Programming and Performance	4	2124	July 2, 2012
Bandwidth problem ? Could anyone verify that this is normal? CUDA Programming and Performance	7	3613	April 25, 2008
Memory bandwidth CUDA Programming and Performance	31	38557	October 5, 2007

Data transfer speed between G80 and main memory

Related topics