Is this possible understanding CUDA

I have i question. I have A SLI mobo and got a 8800GTS 512 (g92) now i was wonderering. Can I put A 8600GT beside it just for the CUDA calcs. And bridge it to the 8800 GTS. Or do I only use my 8800 GTS install the CUDA driver (vista x64)

The G92 8800 GTS will be much faster at CUDA than an 8600GT, so why bother? You can run CUDA and display on the same GPU without any problems.

If you add a 2nd card, you don’t need the SLI bridge to use CUDA on both. In fact, if you enable SLI in the drivers, CUDA can only use one of the cards.

Isn’t there a 5 second limitation for CUDA on a GPU that’s running a display under Windows?

What i mean is Could the 6800GT Card be used only for the cuda calc. (like a phisycs card)


Only G80 and later do CUDA.

The 5 second watchdog rarely comes into play. Most kernels complete in milliseconds, although this is application dependent of course.

“rarely” seems a bit extreme. Some of my kernels may not complete for several minutes. The 5-second rule is a critical limitation to be aware of, IMHO.

An interesting question: what is the overhead of calling a kernel in a loop versus having the kernel loop on its own?

for (i = 0; i < 1000; i++) {





__global__ void Kernel(int values)


  for (int i = 0; i < 1000; i++) { ... }


depends a bit if you are just crossing a register-usage boundary that makes your occupancy go down. a kernel call has some overhead, so normally it is wise to loop in your kernel.

As I said, it is application dependent :) My own applications involves calling short millisecond kernels millions of times, so I am biased the other way. Still, any memory bound kernel can read/write device memory hundreds of times in 5s. My experiences on these forums is that most calls seem to be memory bound, hence my “rarely” comment.

If you can loop inside your kernel, it will result speedup over making many kernel calls. The kernel call overhead is something like ~20 microseconds per call, maybe a little more. But more importantly, your kernel won’t need to dump to global memory after every iteration saving you lots of global memory transfers which should boost performance significantly.

I agree, and just to chime in -

If my feel for things is right, then there’s not only the latency to consider, but also the utilisation of multiprocessors. Exaggerated example: If you run 15 blocks a 100 times on a 16-multiprocessor card, I think you stand a chance of “wasting” one multiprocessor 100 times. It’s not cut-and-dried, but I think I saw this behavior in my app.