SLI use with CUDA programming General CUDA programming

I was reviewing the NVIDIA website and the stated specifications for the 8800 GTX read
“NVIDIA® SLI™ Technology1:
Delivers up to 2x the performance of a single graphics card configuration for unequaled gaming experiences by allowing two graphics cards to run in parallel. The must-have feature for performance PCI Express® graphics, SLI dramatically scales performance on today’s hottest games.” at location http://www.nvidia.com/page/8800_features.html .

Similarly, for the Quadro FX 4600 and 5600:
“NVIDIA SLI Technology
NVIDIA® SLI™ technology enables dynamically
scalable graphics performance, enhanced image
quality, and expanded display real-estate.” at location http://www.nvidia.com/docs/IO/40049/quadro…0_datasheet.pdf

Further documentation at location http://www.nvidia.com/object/quadro_sli.html

“SLI Frame Rendering: Combines two identical NVIDIA Quadro PCI Express graphics cards with an SLI connector to transparently scale application performance on a single display by presenting them as a single graphics card to the operating system.”

Therefore, can I use SLI in conjunction with CUDA to have two identical cards on my machine (any 8800 or Quadro 5600 or 4600) and program 256 multiprocessors as though they were one GPU?

Please assume (somehow) that I can obtain the hardware that is compliant and has sufficient requirements to mount and run the two GPU cards.

SLI and CUDA are orthogonal concepts. The first is for automatic distribution of rasterization, the second is for addressing direct execution of code on the GPU. CUDA is not used for rendering (on- or offscreen). That is when using CUDA you can simply list all available cards in the machine and directly submit code to execute. This code has nothing to do with shader code - it is C-like. So you have a lot more control of what happens where and when.

Peter

Thanks Peter. I get it better now.

You cannot treat two 8800 cards as a single set of 256 processors. You can, however, threat them as two sets of 128 processors each (you’d need to have two threads, each of which would copy the necessary data and launch a kernel on a respective card). Similarly, you can take advantage of 3 cards. One of the reasons could be that cards do not really have shared memory in SLI mode - shared data must be copied from one to the other via the bus. So, if a “unified” look at the two SLI’ed cards were allowed, accessing different global memory addresses could have very different latencies.

Paulius

P.S. The 8800 has 16 multiprocessors, each with 8 stream processors.

Can one ship data from one GPU to another without going through the host, or faster than going through the host?

Not in the current beta release of CUDA, but this is planned for a future release.

Mark

what sort of speeds (or speed-ups vis-a-vis PCI express) are expected for data transfers on this channel?

Very old thread, but any updates on this? Can CUDA 6 use SLI to move data between devices without going through PCIe and CPU?

Looks like the answer is no. CUDA 6.0 programming manual, search for “peer”:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#peer-to-peer-memory-access
http://docs.nvidia.com/cuda/cuda-samples/index.html#new-cuda-code-samples-in-cuda-6-0

Do not use SLI in that case:
http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-general-known-issues

Peer access is disabled between two devices if either of them is in SLI mode.<<

Thanks!