I was reviewing the NVIDIA website and the stated specifications for the 8800 GTX read
“NVIDIA® SLI™ Technology1:
Delivers up to 2x the performance of a single graphics card configuration for unequaled gaming experiences by allowing two graphics cards to run in parallel. The must-have feature for performance PCI Express® graphics, SLI dramatically scales performance on today’s hottest games.” at location [url=“http://www.nvidia.com/page/8800_features.html”]http://www.nvidia.com/page/8800_features.html[/url] .
“SLI Frame Rendering: Combines two identical NVIDIA Quadro PCI Express graphics cards with an SLI connector to transparently scale application performance on a single display by presenting them as a single graphics card to the operating system.”
Therefore, can I use SLI in conjunction with CUDA to have two identical cards on my machine (any 8800 or Quadro 5600 or 4600) and program 256 multiprocessors as though they were one GPU?
Please assume (somehow) that I can obtain the hardware that is compliant and has sufficient requirements to mount and run the two GPU cards.
SLI and CUDA are orthogonal concepts. The first is for automatic distribution of rasterization, the second is for addressing direct execution of code on the GPU. CUDA is not used for rendering (on- or offscreen). That is when using CUDA you can simply list all available cards in the machine and directly submit code to execute. This code has nothing to do with shader code - it is C-like. So you have a lot more control of what happens where and when.
You cannot treat two 8800 cards as a single set of 256 processors. You can, however, threat them as two sets of 128 processors each (you’d need to have two threads, each of which would copy the necessary data and launch a kernel on a respective card). Similarly, you can take advantage of 3 cards. One of the reasons could be that cards do not really have shared memory in SLI mode - shared data must be copied from one to the other via the bus. So, if a “unified” look at the two SLI’ed cards were allowed, accessing different global memory addresses could have very different latencies.
Paulius
P.S. The 8800 has 16 multiprocessors, each with 8 stream processors.