cutil cuda sdk utility

I try cudaGetDeviceCount recently. From demo file multigpu.cu the comments say
// Note that in order to detect multiple GPUs in your system you have to disable
// SLI in the nvidia control panel. Otherwise only one GPU is visible to the
// application. On the other side, you can still extend your desktop to screens
// attached to both GPUs.
I have a window XP with NVidia 8500 and Telsa C870. If disable SLI, I get report of 2 GPUs and processiog time is slow (2000ms). If do not disable SLI, I det report of 1 GPU. The processing time is fast (30ms).
If I have C870, should I get report of 128 GPUs?
Why 2 GPUs process slow than 1 GPU?

Thanks for help!

I guess that with SLI disabled CUDA uses 8500 which is ‘first’. It’s a slow card, so 200 ms is okay for it.

When you turn SLI on CUDA no longer detects 8500 as valid device and swithces to Tesla, which is fast.

No, you shouldn’t get 128 GPUs, you should get as much as installed in your system. And your system has 2. 128 is number of stream processors on board of Tesla, this kind of information is available with other SDK functions.

And it’s not 2 GPUs are slower, it is 8500 which is slower than Tesla. Those SDK samples do not demonstrate simultaneous calculations on several CUDA devices.

Thanks a lot for help! This is really a CUDA device’s number not the stream processors’ number. For the application, we should always enable SLI and use Tesla as many as possible. External Media

No, you do not need to use SLI. You can query the CUDA-capable devices in your computer and use only the fastest one.

I run a test with cublasStrsv and compare the result with strsv.c runs on CPU. To my surprise, with same data set, for a matrix upto 1000X1000, CPU is fast than GPU. Only about first 10 data of the solution of GPU are right, all others are divergent. But the solutions of CPU are all right. What’s wrong with cublasStrsv?

1000x1000 is maybe not big enough for GPU overhead to be cancelled out. And GPU does float’s, while CPU does extended precision floating point calculation, so results may vary a bit. (Or a lot if your algorithm is susceptible to small variances)