How will 9800 GX2 appear to CUDA?

(I know Nvidia employees generally can’t comment on future products, but I figured I’d post this topic and hope for a reply whenever they are allowed to talk about it.)

We’re seeing reports on the web now of a 9800 GX2 card which sounds like two cards bolted together in one package, each half with 128 stream processors and 512 MB of memory, for a total of 256 processors and 1GB. Will a 9800 GX2 card appear as one CUDA device or two? I’m expecting the answer is “two,” but just wanted to check…

I am also looking forward to hear the answer. Is there anybody from Nvidia that will clarify this topic?

Best Regards,

From what I read on sites it is using SLI, so just as with normal SLI you will see 2 CUDA-capable devices probably.

also interested in this card. Wish it could give performance gain for CUDA routines.

Just as with other SLI configurations, if SLI is enabled, CUDA only sees one of the 2 SLI devices. Simply disable SLI in the NVIDIA control panel and CUDA will see 2 GPUs. So with the new 9800GX2, the driver installs to default in SLI mode.
This means deviceQuery will show 1 device.
To see both for CUDA:
Open the NVIDIA Control panel and change 3DSettings->Set multi-GPU configuration to Multi-display mode. (requires reboot)

Thanks for the answer! I had almost forgotten about the question. :)

sometimes we just can’t answer a question - such as about a future card - until it’s announced :D

Does this also means that each gpu has its own 512MB onboard memory?

I hope it is possible to share the 1GB onboard memory between the 2 GPU, just like dual cpus share host memory.

No the memory is per GPU, it is 2 PCB’s with SLI connector, while 1 PCB has a PCIE interface.

I have another question, I hope an NVIDIA person can answer (here or with pm)

Will we in the near future (before 2009) get boards like the 9800GX2 that have bandwith comparable with 8800GTX? I have a program that needs probably 4 GPU’s to perform realtime, and the formfactor of the S870 is incompatible with our needs. So a single PC with 2x 9800GX2 would do the trick, BUT, all my kernels are bandwith bound so the smaller memory buswidth (despite the higher memory clock) of the 9800GX2 will likely prevent me from using this as a solution.

Now, I still have some time to wait for new products, and read something about a G200 and end of the year, but this surely is something you cannot comment about, but hopefully you can answer the question above.

Conjecturing here: NVIDIA may be sticking with 256 bit memory buses for the near future due to expense of a 384 bit bus. If that is the case, then you will see bandwidth parity between the 8800 GTX and some hypothetical future 256-bit card when the memory clock reaches somewhere around 2200 MHz. This is the leaked clock rate of the upcoming 9800 GTX, which might explain why it gets the GTX designation. (So you would have to wait until this faster memory clock gets push onto a future X2 board…)

My understanding is that the 2200Mhz leaked clock rate of the 9800GTX refers to the base memory clock *2 . The *2 coming from the DDR. The actual clock rate would be 1100Mhz compared to the 900Mhz of the 8800GTX, it is better but wont compensate the 256 bit versus 384 bit bus of the 8800 GTX.

384-bit busses are for wimps. 512-bit for the win!

Seriously, I second the request for more memory bandwidth (much of my app is memory bound, too). But, toms hardware did benchmark a slight disadvantage for the 9800 at extremely high resolutions with AA enabled, which is probably because of the lower amount of memory bandwidth (possibly the lower amount of total memory) compared to the Ultra. So I also understand the kind of memory bandwidth I would like probably isn’t economical for the “graphics” side of the card since it isn’t really needed at the highest resolutions with AA.

Anyway, given NVIDIA’s the policy on discussing future products, we will only find out when the next gen chips are released.…800_GX2/?page=8

Thanks to rockypg

Is there any plans to allow CUDA to work if SLI is enabled? I can imagine a driver that would let CUDA and SLI be more dynamic. If there graphics can be rendered on a single GPU, the SLI could be soft disabled. During that time, more processing power would be available for CUDA. Then if the graphics requirements required the second GPU, CUDA could be scaled back to a single GPU again.

Disabling the SLI made sense last year but now that both SLI and CUDA are becomming more mainstream, perhaps this engineering decision should be revisitied?

Along the SLI thinking:

It would be wonderful if there was a way for CUDA to divide kernel blocks between multiple cards without requiring the application to spawn CPU threads and manage it explicitly. Certainly, it would not work for all kernels, but for many cases, all you need is for cudaMalloc and cudaMemcpy-like functions in the host-to-device direction to transparently mirror to all cards. The semantics of global memory writes in this card-mirrored world are trickier to define, but not impossible. If you exclude atomics, multiple blocks writing to the same location is generally a bug, and without block synchronization, blocks should not be reading each other’s writes in the same kernel call. Then there is the problem of merging global array modifications between kernel calls, and what happens when you memcpy in the device-to-host direction.

Anyway, I’m getting off-topic, but the limitations of block communication actually make it possible to implement some kind of SLI-for-CUDA. (80/20 rule applies here, of course. explicit management should always be an option)

Well, lets start with device to device copies between cards by SLI. That would already help a lot with dividing an application on 2 GPU’s since the overhead of copying from 1 GPU to the other will be much much less.