How many GBit/s from GPU to RAM and back Performace in GBit/s


I’m new to the CUDA and Tesla concepts.

I would therefore like to know how many GBit/s it is possible to transfer from the GPU(TESLA or other) to the RAM and the back again, using all the newest available hardware.

Check the FAQ:
Programming question #16

The latest PCIe gen2 cards can attain 4-6 GiB/s depending on the mainboard.

Ok thank you for the fast reponse. But i have another question.

It’s it possible to use CUDA with regular SLI or Three-way SLI to get more bandwidth?
And is the 4-6 GiB/s only one way or is it full duplex?

I am looking for a ca. 105 MByte/s link between the GPU and RAM. Would this be possible with the newest hardware? If not, where do you think the bottlenecks would be?

SLI will not provide you with more bandwidth. CUDA doesn’t use SLI.

If you have more than one GPU in the system you can program each separately with CUDA. You could transfer to each GPU simultaneously, but now half of your data is on each GPU (which isn’t necessarily a problem as, depending on your algorithm, you may be able to operate on each half independantly).

105 MiB/s is << 4 GiB/s, you should have absolutely no problems unless you are making extremely small transfers (< 16k per transfer).

Sorry it is 105Gbyte/s i need. So it is much bigger than the 4 GB/s …

The GTX 260 and 280 can transfer data this quickly between the GPU and the RAM on the graphics card, but not to system RAM. There is no PC workstation in existence with this kind of bandwidth to system memory.

Then you need a true supercomputer: regular x86 architectures can’t have that much bandwidth (so fast a main RAM does not even exist).

The fastest workstation I can think of, uses DDR3-1600 RAM in dual channel. The theoretical max bandwith is 25.6 GB/s, yet actual measured max bandwidth does not reach half as much.

Plus, even the fastest workstation bus (PCI Express 2.0) has a max. theoretical bw of 16 GB/s.

So we’re very far from your needs.


And true supercomputers only attain this level of bandwidth through parallel IO. You have to decompose your problem into many small chunks each of which only needs a portion of the bandwidth. Each chunk is then on a separate node of a cluster using something like MPI.

Sorry guys, but i made another typo… I should say 105 Gbit/s and not 105 Gbyte/s, which of course would be crazy…

You say that DDR3-1600 RAM in dual channel mode could give a bandwidht of about 25.6 GB/s then it would be 204.8 Gbit/s in theory. And the theoretically bandwidth of PCI-e is 16 GB/s = 128 Gbit/s, then i should, in theory, be able to get the 105 Gbit/s both ways (full duplex) i’m looking for??

Again thank you for your fast answers and sorry for the two major typo’s…


as far as I know, those are theoretical maximum values for burst transfers.
Actual maximum transfer rates across PCI Express are much lower than 16 GB/s.
Based upon the benchmarks I’ve seen in the forum, I think that the max. real-world GPU<->main RAM bandwidth is about 3.1 GB/s (or 24 Gbit/s) from host to device and 2.2 GB/s from device to host (which has always been slower).
Maybe something will change with the new x86 platform from Intel in October/November (Nehalem, with the accompaining new chipsets), and maybe not.