Architecture Questions

philprattszeliga · February 2, 2008, 7:49pm

Hello,

I have a GeForce 8600 GTS and I am using CUDA. I have run the deviceQuery example and it says the following:

Max threads per block: 512
Max dimension of a block: 512 x 512 x 64
Max dimension of a grid: 65535 x 65535 x 1

These are the max values and I read that any configuration greater than the available hardware will pipeline the execution. What I would like to know is what are the dimensions of my board without any pipelineing? I want to max out the boards parallelism without going into pipelining the execution to get theoretical max performance values.

Where can I read more about the architecture? For instance, it says on the website that my board has 12 multiprocessors. How many processors are inside each multiprocessor and how many hardware threads can each processor run?

kuisma · February 2, 2008, 8:49pm

“CUDA Programming Guide” is an excellent start.

You find it at [url=“http://www.nvidia.com/object/cuda_develop.html”]http://www.nvidia.com/object/cuda_develop.html[/url]

Kuisma

philprattszeliga · February 4, 2008, 5:59pm

Okay, yes that is a good reference. I have read it before but not carefully enough. I have alot to learn. :)

It says:
Each multiprocessor is composed of 8 processors so that a multiprocessor can process 32 threads of a warp in 4 clock cycles.

I know this is probably naive, but could I do a configuration with 8 threads per multiprocessor to process 8 threads in 1 clock cycle??

kuisma · February 4, 2008, 6:12pm

No.

kuisma · February 4, 2008, 6:16pm

Have a look at appendix A - it contains all the figures you’ll need.

Mark_Harris · February 11, 2008, 2:29pm

Technically yes, you can run 8 threads per multiprocessor. Launch as many thread blocks as there are multiprocessors, 8 threads per block.

But the real answer to your question, as posted before, is no. Because threads are run 32 at a time and they take 4 cycles per warp. So your 8 threads would take 4 cycles per instruction too.

In general you should try to run at least a few warps per thread block (>96 threads) to help hide latency.

Mark

Sarnath · February 12, 2008, 11:35am

Philp…,

Yeah, I understand your hatred towards pipelining. But what happens usually is that you need more active threads per multi-processor than there is compute hardware to hide latencies.

To hide register-latencies (like register read-write hazzard , pipeline hazards and so on) and to hide memory-access latencies (like global memory is slow) – you need atleast 192 threads per multi-processor.

This 192 could either be 1Block192threadsPerBlock OR 6Blocks32ThreadsPerBlock and so on. The configuration is upto your application.

Topic		Replies	Views
Organization of threads CUDA Programming and Performance	1	644	December 21, 2011
threads per block / multi processor, contradiction ? CUDA Programming and Performance	5	1656	January 23, 2009
finding the best number of threads per block CUDA Programming and Performance	3	7846	January 29, 2010
How to use blocks CUDA Programming and Performance	1	3568	November 26, 2007
Maximum of threads On 8600GT CUDA Programming and Performance	6	3569	April 9, 2008
A question about the CUDA's thread parallelization CUDA Programming and Performance	12	63011	January 25, 2009
How many concurrently running threads CUDA Programming and Performance	1	2973	July 1, 2007
number of threads and registers CUDA Programming and Performance	10	4866	March 14, 2008
multiprocessors CUDA Programming and Performance	3	5620	May 22, 2007
Newbie questions CUDA Programming and Performance	4	3892	March 4, 2008

Architecture Questions

Related topics