CUDA and openCL support for multiple GPU/PCI devices?


I’m about to purchase a PCI video card so that I can use more displays on my PC. I’m debating getting the nvidia 5200 PCI card, or the nvidia 8400s PCI card. The price for the 5200 based card is about $33. and the 8400s is about $65(both dual head) so I could almost buy two 5200 cards for the price of one 8400 PCI card(or take $30 off another LCD display). I already have used my PCIx slot so I need to go with PCI. Further down the road, I will purchase a new system with a PCIx16 9000 series card, and add in my PCI video cards for multimonitor support. Maybe I sound cheap, but I will likely purchase many of these cards for multiple systems, so the cost could add up.

I have multiple questions:
Has anyone benchmarked CUDA apps running on the 8400s PCI video card against an 8400s PCIx video card? If no benchmark data is available, does anyone have any experience running CUDA on a PCI card which would help me make a more informed decision?
Obviously PCI is a huge bottle neck, but I wonder how important that is when the card comes stocked with 256 MB.
This will decide whether I purchase the 8400s or 5200.

Can multiple CUDA cards be combined and utilized by the CUDA and/or openCL api? In other words, when I purchase a new PCIx cuda card, will I be able to use the cuda functionaliy in both the new card and the PCI card I will be purchasing soon?

If they can’t be used within the same app, can one cuda card be used in one application while another card be used in the second CUDA application?

IS SLI required to utilize cuda from multiple cuda devices?



If you have multiple devices in the system, you actually need to turn SLI off to use CUDA with them. Yes, you can use multiple devices in one system. When you get working with CUDA, you’ll find a few sample projects in the SDK that demonstrate how to program this situation.

I can’t vouch for everyone here, but I think that most people would warn you away from using a PCI card with CUDA unless you just want to learn with it – the bandwidth to/from the card is much too low to support any kind of serious use, and will present too much of a bottleneck to be useful for serious simluations/calculations.

Also, check the CUDA website at to make sure that whatever cards you buy are actually compatible with CUDA. It would stink to buy something (even if it was inexpensive) and then not have it work for what you want.

I just purchased the “Sparkle GeForce 9400 GT Video Card - 1GB DDR2, PCI, DVI, VGA” for my p4 3 GHZ system. Yes, I’m just getting it to learn CUDA/OpenCL now and get my feet wet. It was only $89.99 at tigerdirect, which seems like a good deal for a dual head 1 GB card regardless. You can never have too many displays IMO.

Folks are saying the PCI bus doesn’t have enough bandwidth for CUDA, but I haven’t found any benchmarks that confirm that this is the case. If someone has any benchmarks comparing CUDA performance on PCI vs PCI-E Please post them. There are no AGP CUDA cards.

The card has a ton of memory(I’m using a 2MB PCI card at work for my second display :) ). So if I write the application properly(neural net pattern recognition) I think I should get a nice performance increase. That is, I fill the video ram with my data to be processed by the GPU on initialization, process the hell out of it, then write it out to disk.

Then again, the cpu is managing the CUDA threads, so the extra latency of the PCI bus will be unavoidable in that case.

It would be great if the PCI card and the PCI-E card can work in tandem. You only get so many PCI-E slots in a PC, and the latency/lower bandwidth of the PCI bus is trivial compared to what you will have clustering CUDA machines over a LAN to make a beowulf.

I’ll run the folding@home when I get the PCI card installed next week and update this thread with the benchmark results.

If you want to check the bandwidth, there is a sample project included in the CUDA SDK called “bandwidthTest”. Compile that, run it, and see what kind of results you get. If you do, post your results here so that others can find the information here later if they need it :)

For reference, my development machine has a 2.4Ghz Core 2 Duo, 4GB DDR2-800 RAM, and a PCIe x16 512MB 8800GT, and I get the following results:

Running on......

	  device 0:GeForce 8800 GT

Quick Mode

Host to Device Bandwidth for Pageable memory


Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   1663.9

Quick Mode

Device to Host Bandwidth for Pageable memory


Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   1324.7

Quick Mode

Device to Device Bandwidth


Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   39400.6

&&&& Test PASSED

Here is the only benchmark I’ve seen in the forum:…st&p=509966

90 MB/sec host->device and 110 MB/sec device->host.

Thanks for the info on the bandwidth test. I’ve downloaded the sdk and will run and post benchmarks later when I have a chance. It would be great if someone could point me to cuda applications that don’t just test bandwidth but actual cuda performance. I’m curious to see how my pci 9400 GT measures up against a PCIe x16 9400 GT.

I finally installed the 9400 GT card. Unfortunately I only have a single PCI slot in my shuttle case (P4 3.0 GHZ w/HT, 2 GB RAM, Win XP) and the heat sink on the card was so thick I had to remove my AGP 5200 card. Maybe I can replace the heatsink with one smaller so I can use the AGP for video and the PCI card just for CUDA.

I installed the latest drivers, and the folding@home gpu program. I seem to be getting pretty good performance with folding@home. I’m doing about 1-2 work units/day and getting extra bonus points. You can find my results on the folding@home site under the username matt7273. So far I’m getting about 350 points per work unit.

CPU usage stays very low(under 5%) almost all of the time it is folding, though when I open the visualization CPU usage goes way up. One time when I launched the visualization it blue screened! There was an easy workaround for that: don’t run the visualizer! Now its running smooth. However, I can tell PCI bandwidth is being maxxed out because screen redraws of windows is noticeable slower, likely a problem with the GDI. Perhaps if I had Vista(which I believe uses DirectX for the display instead of GDI) I wouldn’t have this problem.

When i ran the visualizer I noticed it would compute about .1% of the folding, and then pause for about a minute before continuing. This may be the PCI bandwidth limit showing itself as the results are sent from the card back to the CPU/disk.

I ran Quake 4 demo to see how game performance was, and it was definitely faster than my older 5200 AGP (4x?) card. This card definitely breathes new life into an older system, and runs noticeably cooler. The main problem of course is it will take up PCI bandwidth which may be needed by other devices(I think the IDE interface connects over PCI for instance).

It’s too bad there aren’t any AGP cards that have CUDA support. There are ATI AGP cards that support ‘close to metal’ and openCL, but my experience with ATI drivers in the past has led me to only use nvidia. Also it seems like nVidia is more serious about non gaming uses for their cards than ATI, as ATI does not have an equivalent of the Tesla card(yet?)

Its nice to know that when I get a new PC in the next month I’ll be able to reuse this card and develop with CUDA.

Try running the nbody program from the SDK in release mode. It will be in the ‘bin’ folder in the SDK (then Win32 or win64, then ‘Release’) as nbody.exe. You can run it from the command prompt like:

nbody -benchmark -n=2048

And see what your results are. You can change the n=2048 to other powers of two (1024, 4096, 8192, etc.) to play around with it. For the 9400GT, the default size is n=2048 (like I wrote above). Since the program runs n^2 computations, each time you move up the number of particles by 2x (e.g. from 2048 to 4096), the compute time will increase by about 4x.

That will provide you with a good “computing speed” benchmark versus any of the other cards…

Thanks. I see there’s new CUDA downloads, I will try to get to running the nbody program this week.

I still haven’t run pure benchmarks, or had time to read up or play with CUDA (instead very busy with a webDAV project. Gotta pay the bills.)

However, I recently purchased a laptop with the 9300 GPU, which has half the cores(8) as my PCI 9400 card(16)

I’ve been running folding@home 24/7 on both systems since the purchase, and I’m finding with this particular application the 16 core PCI card is completing work units twice as fast as the 8 core laptop. Obviously protein folding by its very nature is unpredictable as a short term benchmark, but I think a full week is enough to lead come to this tentative conclusion: With this particular application(folding@home gpu build) the lower PCI bandwidth of 133 MB/sec is not a significant factor. CPU usage on both systems remains close to 0 throughout the folding process so I don’t see that as a factor in this case. Only caveat is my PCI card has 1 GB, and the laptop card has 512 MB, which is also half. I seriously doubt folding@home is using much vram though.

however, I’m not going to recommend the 9400 PCI solution unless you have a very limited budget. The card is about twice as much as the PCIe card, and only has 16 cores. It seems CUDA scales with the number of cores pretty closely, and you would be better advised to buy a card like the 9500 PCIe card with 32 cores, and a dell inspiron from dell outlet for $189. thats the next supercomputer for me I think.