Architecture Questions


I have a GeForce 8600 GTS and I am using CUDA. I have run the deviceQuery example and it says the following:

Max threads per block: 512
Max dimension of a block: 512 x 512 x 64
Max dimension of a grid: 65535 x 65535 x 1

These are the max values and I read that any configuration greater than the available hardware will pipeline the execution. What I would like to know is what are the dimensions of my board without any pipelineing? I want to max out the boards parallelism without going into pipelining the execution to get theoretical max performance values.

Where can I read more about the architecture? For instance, it says on the website that my board has 12 multiprocessors. How many processors are inside each multiprocessor and how many hardware threads can each processor run?

“CUDA Programming Guide” is an excellent start.

You find it at

  • Kuisma

Okay, yes that is a good reference. I have read it before but not carefully enough. I have alot to learn. :)

It says:
Each multiprocessor is composed of 8 processors so that a multiprocessor can process 32 threads of a warp in 4 clock cycles.

I know this is probably naive, but could I do a configuration with 8 threads per multiprocessor to process 8 threads in 1 clock cycle??


Have a look at appendix A - it contains all the figures you’ll need.

Technically yes, you can run 8 threads per multiprocessor. Launch as many thread blocks as there are multiprocessors, 8 threads per block.

But the real answer to your question, as posted before, is no. Because threads are run 32 at a time and they take 4 cycles per warp. So your 8 threads would take 4 cycles per instruction too.

In general you should try to run at least a few warps per thread block (>96 threads) to help hide latency.



Yeah, I understand your hatred towards pipelining. But what happens usually is that you need more active threads per multi-processor than there is compute hardware to hide latencies.

To hide register-latencies (like register read-write hazzard , pipeline hazards and so on) and to hide memory-access latencies (like global memory is slow) – you need atleast 192 threads per multi-processor.

This 192 could either be 1Block192threadsPerBlock OR 6Blocks32ThreadsPerBlock and so on. The configuration is upto your application.