How many concurrent threads?

diehard2 · June 7, 2008, 3:40pm

Hi guys,

So I’m trying to figure out if I have the threading model correct. The statement below is how I currently understand CUDA. If any statement is wrong, could you please tell me where I’ve made a mistake? Thanks.

Each card has several multiprocessors. Each multiprocessor has 8 processors. Each processor can execute 768 threads at once. The 8 processors have shared memory they can access.

So, for a geforce 8600 with 4 multiprocessors, I can have a maximum of

48768 = 24576 threads executing nearly concurrently over 32 blocks, and the remaining threads will be scheduled to be processed after these threads complete.

OK, where exactly do the grids the SDK docs talk about come into play and are the rest of my statements correct? Is a grid just a bunch of blocks that the kernel implements? Thanks for the help.

MisterAnderson42 · June 7, 2008, 4:52pm

You are close: Each card has several multiprocessors (correct). Each multiprocessor has 8 ALUs (it is easier to think of these as ALUs instead of processors because all 8 ALUs share the same instruction decoder and other resources). Each multiprocessor can handle up to 768 threads concurrently.

So the total number of concurrent threads is num_multiprocessors * 768 or 12288 for an 8800 GTX. Scheduling is 100% “for free” so there is no penalty for going over this number.

Yep. You specify the size of the grid and that many blocks will be launched by the GPU hardware.

diehard2 · June 7, 2008, 5:24pm

Thanks for the response. What exactly does ALU stand for? I’ve done some more testing, and it looks like 512 is the actual max number of threads for me. Even if I pass a blank kernel, if I have 513, the kernel doesn’t even execute. Anything 512 or below works like a charm. I’m using the beta on vista. Is this possibly a platform limitation? Thanks

seibert · June 7, 2008, 5:53pm

Arithmetic Logic Unit. It’s the generic term for digital logic that does mathematical operations like addition, subtraction, multiplication, etc. as well as bitwise operations, like AND, OR, XOR, shift, roll.

There is a limit of 512 threads per block, but a multiprocessor can multiplex several blocks at a time, assuming there is sufficient register and shared memory resources to do so. To hit the 768 thread limit, you have to run a kernel with 256 threads per block, and run at least 3x as many blocks as multiprocessors.

diehard2 · June 7, 2008, 6:15pm

Thank you, that makes a great deal of sense. I was finding this hard to wrap my head around.

Topic		Replies	Views
Distribution of Threads to Multiprocessors CUDA Programming and Performance	8	13692	June 8, 2011
A question about the CUDA's thread parallelization CUDA Programming and Performance	12	63151	January 25, 2009
How many concurrently running threads CUDA Programming and Performance	1	3025	July 1, 2007
max number of threads CUDA Programming and Performance	3	3228	October 10, 2008
What is the actual limit on simultaneously running threads? Asin, is it possible for more than one b CUDA Programming and Performance	20	2725	September 16, 2010
A question the parallelization CUDA Programming and Performance	5	2762	July 29, 2008
A question the parallelization CUDA Programming and Performance	1	1219	July 28, 2008
finding the best number of threads per block CUDA Programming and Performance	3	7904	January 29, 2010
number of threads and registers CUDA Programming and Performance	10	4975	March 14, 2008
threads how many threads can simultaneously execute? CUDA Programming and Performance	1	2018	February 27, 2009

How many concurrent threads?

Related topics