Transfer data from host to device Transfer 10G

What is the MAX bandwidth I can get with GTX 295?
Can I improve it with external device (BUS, DMA, Special memory…)?
I ask because I need to transfer 10G and to do Matrix-Vector Multiplication.

you can use bandwidthTest.exe in SDK example to test,

in my platform: GTX 295

host to device: 1.1 GB/s

device to host: 1.7 GB/s

You must buy a Mainboard with PCIe 2.0.

Then you have a max datatransfer from 5000MByte per second (PC to GPU) and 5000MByte per second (GPU to PC). The PCIe Bus is a bidirectional Bus, so you can read and write data at the same time.

You can buy a second GPU. So you can copy the first picture in the first GPU and the second picture in the second GPU. So you can copy 10000MByte per second.


Do you know how to get more B.W (Maybe special PICe)?



You’ll always be limited by the PCIe of the board, I guess, since even if you have a very fast external device

you need to connect it to the miserable x16 lanes you have on the motherboard.

Does your calculation take a lot of time? If so you can try to move half the data, run the kernel async, copy the

other half and than run another async kernel on the second half. That way you can save time and do things at the same time (calculate and transfer data).

I think someone here refered to Nehalem as the fastest available machine - PCI wise


PCIe (2.0) is a standardized port and the graphics hardware are manufactured for this standard.

It doesn’t exists a “special” thing.

You can overclock your system. But i think you a stable and reliable system is more desired…

I ask because I need to transfer 10G and to do Matrix-Vector Multiplication.

host to device: 1.1 GB/s

10G Host->Device / 5G (result back) Device->Host

OK. it takes some time, but what’s the problem :) - you can’t calculate any problems in realtime :))

I need to get better results than CPU.

What/Who is Nehalem?

Nehalem are the new Intel chips.

Look here at my (stupid) suggestion and what tmurray said:…40&start=40

posts #44 - 47

how about the other question? how long does your kernel take?


I talk about 268M and I run on Quadro FX 1700.

The tranfer to GPU takes 95.5 (ms) -> 2.8GB/s.

              Kernel takes 26.6 (ms) 

I want to buy GTX 295 so I believe I can get:

Transfer (5GB/s) -> 53.48 (ms).

Kernel x30 0.88 (ms).

What do you think?


Well, 5GB/s is probably optimistic but in anycase the ratio between the kernel and transfer is still very high.

Any way you can save data from being copied to the GPU? for example, I had three arrays, one was sqrt of the other

I found that it was better to calculate it on the fly than to pass it as input to the GPU.

Maybe if you can elaborate more on what is the data you move and what does your kernel do, someone will have

idea as to how to improve this ratio…

edit: BTW - did you try pinned memory?


The FX1700 is a PCI-e gen2 card.
You will not see faster PCI-e transfer using the GeForce.

PCI-e transfer speed depends on the MB/chipset.

Intel’s upcoming Lynnfield CPUs will have PCIe 2.0 integrated in the CPU itself

and might therefore achieve higher bandwidth and certainly lower latencies.

Lynfield Overview @ anandtech

Do you know how much bandwidth we can get with this chip and GTX295?

It hasn’t been released yet. There are no performance numbers. But it will be less that 6.4 Gb/s because that is all that is practically achievable under PCI-e 2.0 with 16 lanes.

Technical point on the GTX295… don’t the cards split the bus, so that each only really has x8 bandwidth? When I had a GX2, I think I found something like this was happening.

The GTX295 switches the bus, so can can get close to the full available PCI-e bandwidth to one of the GPUs if the other one is idle. WIth both going, it will be reduce to slightly less than half per GPU.

OK - that’s good to know. The GTX295 is like one cable from an S1070 :)


I want to buy a new machine (~2200$).

I want the GTX295.

I understood that there are some kinds of PCIex16.

Which of them do you suggest?