So I’m trying to figure out if I have the threading model correct. The statement below is how I currently understand CUDA. If any statement is wrong, could you please tell me where I’ve made a mistake? Thanks.
Each card has several multiprocessors. Each multiprocessor has 8 processors. Each processor can execute 768 threads at once. The 8 processors have shared memory they can access.
So, for a geforce 8600 with 4 multiprocessors, I can have a maximum of
48768 = 24576 threads executing nearly concurrently over 32 blocks, and the remaining threads will be scheduled to be processed after these threads complete.
OK, where exactly do the grids the SDK docs talk about come into play and are the rest of my statements correct? Is a grid just a bunch of blocks that the kernel implements? Thanks for the help.
You are close: Each card has several multiprocessors (correct). Each multiprocessor has 8 ALUs (it is easier to think of these as ALUs instead of processors because all 8 ALUs share the same instruction decoder and other resources). Each multiprocessor can handle up to 768 threads concurrently.
So the total number of concurrent threads is num_multiprocessors * 768 or 12288 for an 8800 GTX. Scheduling is 100% “for free” so there is no penalty for going over this number.
Yep. You specify the size of the grid and that many blocks will be launched by the GPU hardware.
Thanks for the response. What exactly does ALU stand for? I’ve done some more testing, and it looks like 512 is the actual max number of threads for me. Even if I pass a blank kernel, if I have 513, the kernel doesn’t even execute. Anything 512 or below works like a charm. I’m using the beta on vista. Is this possibly a platform limitation? Thanks
Arithmetic Logic Unit. It’s the generic term for digital logic that does mathematical operations like addition, subtraction, multiplication, etc. as well as bitwise operations, like AND, OR, XOR, shift, roll.
There is a limit of 512 threads per block, but a multiprocessor can multiplex several blocks at a time, assuming there is sufficient register and shared memory resources to do so. To hit the 768 thread limit, you have to run a kernel with 256 threads per block, and run at least 3x as many blocks as multiprocessors.