I’m not sure; as far as I have seen, NVidia uses DMA for all transfers between the card and the host. Using this instruction would make transfers synchronous and I don’t think it’d be faster than DMA.
Well if 800 MB/sec for OpenGL readback and 2,500 MB/sec in CUDA (with my 8800GTX) is the best they can do with DMA (provided that is with DMA at all), then yes, this would be faster because it is very close to theoretical bandwidth limit of the Front Side Bus. At 1066MHz FSB that would be close to 8.5 GB/sec.
It would be nice if someone from NVIDIA could chime in and say “possible / not possible”.
And whatever bandwidth is needed to write things back to memory. Presuming this fancy new instruction could max out PCIe2 bandwidth of 8GB/s, that data is still coming over the FSB into the proc and then going out to memory over the FSB, so the FSB is going to limit it. DMA is better because it doesn’t involve the CPU. And with the right mobo, you can get 3.2 GB/s transfer to/from the card which is as close to the theoretical peak as you can get due to some overhead in the PCIe transfers (info from a post by Mark Harris that I’m too lazy to link to).
Huh, well, I get 3.1 GiB/s all the time on my system (an ancient 3 year old dell). tachyon_john is using an NVIDIA mainboard with LinkBoost that overclocks the PCIe lanes and achieves over 4 GiB/s IIRC.
Edit - Link: http://forums.nvidia.com/index.php?showtop…39&hl=linkboost (I guess the linkboost only got 3900MB/s). Anyways, just search the forums, a lot of people report ~3.1 GiB/s, though most systems do seem to average around 2.5 GiB/s.)
I’m still waiting to see someone post benchmarks with the latest PCIe v2 mobos and mathing G92 cards … anyone?
The new instructions might improve performance of pageable host memory.
They will not improve performance of host memory allocated with cuMemAllocHost. But, apps that use cuMemAllocHost might benefit from doing streaming loads from the buffer written by device->host memcpy.
How that may be used by NVIDIA I don’t know but the gains are certainly worthwile.
As for others boasting 3.2GB/sec that is host to device transfer. What I am more interested in is device to host transfer especially in OpenGL which seems to be a lot slower for some reason. That is also the place I believe this instruction might prove usefull but maybe I am wrong, again – that is why I asked in the first place.
Any ideas how an Intel X38 based DDR2 system (the Sun Ultra 24) can be so much faster then a NVIDIA nForce-780i based DDR2 system? Don’t the GeForce and nForce like each other? ;-)
I have essentially the same questions. With a 780i, 8800GT and SLI enabled memory (read PCIe 2.0 and G92 core) I see speeds similar to a 680i / 8800 GTX for host->device and device<-host per the bandwidth test, just a 10% improvement Should I be looking at the Intel chipset for CUDA applications?
Is there some tuning / programming tool available for the 780i?
Replacing my Nvidia 780i motherboard (Asus P5N-T Deluxe) with an Intel X38 (Asus P5E WS PRO), the bandwidth host<->device increased from ~3200 to ~4600MB/s pinned (both directions) and memory bandwidth measured by memtest86+ increased from 3749 to 4479 MB/s. All other hardware was the same, and running default BIOS settings in both cases and most recent BIOS.