Using GTX 590 cards for CUDA SLI cards under CUDA?

each GPU in the 590 will be able to talk to one another. we’re not allowing different cards to talk to one another (no GF100 to GF104, for example).

I was wondering if anyone has tested 4.0rc2 on GTX590 for the new P2P feature. The given example in the 4.0rc2 sdk would only run when more than two devices with “Tesla” appeared in the their device names (simpleP2P.cu:101). However from the “IsGPUCapableP2P” function (simpleP2P.cu:51) shows it only checks the major SM version (>=2)…can anyone confirm that? Thanks!

I just installed a GTX 590 (ASUS) and CUDA 4.0 RC2 on 64 bit ubuntu (10.04 LTS) linux (note that a 64 bit OS is required for P2P). With a couple tweaks to ‘simpleP2P.cu’ (which still had the “Tesla Board only” limitation held over from RC1), I was able to confirm that the GTX 590’s support P2P within the card at ~6GB/sec (full 16x PCIe 2.0 speed as suspected).

I have yet to confirm the assumption that the bandwidth between the HOST and two devices on (one card) is halved when copying data to/from both devices simultaneously. But it makes sense that this would be true.

[font=“Courier New”][./simpleP2P] starting…

Checking for multiple GPUs…

CUDA-capable device count: 2

GPU0 = “GeForce GTX 590” IS capable of Peer-to-Peer (P2P)

GPU1 = “GeForce GTX 590” IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access…

Peer access from GeForce GTX 590 (GPU0) -> GeForce GTX 590 (GPU1) : Yes

Peer access from GeForce GTX 590 (GPU1) -> GeForce GTX 590 (GPU0) : Yes

Enabling peer access between GPU0 and GPU1…

Checking GPU0 and GPU1 for UVA capabilities…

GeForce GTX 590 (GPU0) supports UVA: Yes

GeForce GTX 590 (GPU1) supports UVA: Yes

Both GPUs can support UVA, enabling…

Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…

Creating event handles…

cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 6.17GB/s

Preparing host buffer and memcpy to GPU0…

Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…

Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…

Copy data back to host from GPU0 and verify results…

Enabling peer access…

Shutting down…

[./simpleP2P] test results…

PASSED

Press ENTER to exit…[/font]

We installed 2 GTX-590’s in a Dell T7500, 12 core machine with 60gb of RAM.
Our main application would be to use the cards for GPU rendering in 3ds Max.
The cards are Crazy fast, but put out a LOT of heat. They would need to be water cooled to be usefull.
There is a lot of talk on the 3d forums that the memory limitation is (2 GPU’s, 1.5 GB) a real downer.

Am I to understand correctly that the latest CUDA 4.0 RC2, just might… maybe… perhaps… get around this?
Thanks,
David

I don’t see how the 590 would differ in P2P transfer from having ex 2 580s. The only difference is that there is a PCIe switch on the GPU card instead of on the motherboard. This is all transparent according to the PCIe standard. What makes the 590 different from anything else ?

An onboard PCI-e switch. Theorectically, P2P transfers can happen at the switch level without ever hitting the host PCI-e bus.

On the host side you may well have just another PCI-e switch…

In the case of 2 separate GPU cards you should be able to reach the same throughput, the PCIe switch is simply physically further away.

Yes, but the GTX590 won’t consume host PCI-e bandwidth doing P2P transfers between the GPUs on the same card.

Yes, that’s neat.

So theoretically having 2 GTX-590’s (4 cores) will allow you to use the full 6gb of RAM?
David

CUDA devices are always independent. They are not merged into any kind of virtual “super-device.” With two GTX 590 cards, a CUDA application sees 4 separate GPUs, each with direct access to 1.5 GB of device memory. Thanks to unified virtual addressing in CUDA 4.0, you can pass a memory pointer from one device to another device, and inter-device reads and writes will be handled automatically through the PCI-Express bus, although the bandwidth will be significantly slower than accessing the device memory connected directly to the GPU.

Does that answer your question?

So if you have multiple GPUs on NF200 PCIe switches and you want to download the same data to all of them simultaneously is it possible to do it by sending the data just once (like broadcast packets in ethernet)?

You can send the data to one of the GPUs behind the switch and then transfer from that one to the other, while the host is already sending to the next card.

Unless I’ve missunderstood you, for that to be of any benefit I’d need to have some other transfer I could be doing from the host to one of the GPUs behind the switch whilst the GPU to GPU transfer was going on. In fact even then I’d potentially end up with the two transfers sharing PCIe bandwidth behind the switch. I was thinking of something more like this:

http://www.tomshardware.co.uk/3-way-sli-p55-nf200,review-32021.html

In my particular application I could imagine having a large number of GPUs behind just a single 16x PCIe bus and with this broadcast feature the performance could be virtually as good as having each GPU on its own 16x PCIe bus. Of course, each GPU would generate different data so readback would be more of a problem but the results generally have to be stored somewhere (on disk for example) and its pretty tough keeping up with a single 16x PCIe bus.

I agree, there shouldn’t be any potential gain there.

I ran the simpleP2P test but is said that my GTX590 does not support p2p. OS is WinXP x64. Do anybody know how to enable the p2p communication?

Sorry I would like to ask friends in this thread about your performance for GTX590 host to device and device to host. I’m getting some strange performance bandwidth as report in the thread : http://forums.nvidia.com/index.php?showtopic=218954
I am still investigating the source of trouble because my PCIE is x16 v2.0.

It would be great if an application using a single card could automatically scale to 2 cards if present.

I know it’s possible if you code with multiple cards in mind.

But I wonder if something like “AFR” in graphics can be accomplished driver-side for CUDA.

In that way even if the application addresses a single card, it could be scaled to 2-3-4 cards seamlessly.