help me understand cuda

i am having some troubles understanding threads in NVIDIA gpu architecture with cuda.

please could anybody clarify these info: an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs.

i was viewing Stanford’s video presentation and it was saying that every SP is capable of running 96 threads concurrently. does this mean that it (SP) can run 96/32=3 warps concurrently?

moreover, since every SP can run 96 threads and we have 8 SPs in every SM. does this mean that every SM can run 96*8=768 threads concurrently?? but if every SM can run a single Block at a time, and the maximum number of threads in a block is 512, so what is the purpose of running 768 threads concurrently and have a max of 512 threads?

a more general question is:how are blocks,threads,and warps distributed to SMs and SPs? i read that every SM gets a single block to execute at a time and threads in a block is divided into warps (32 threads), and SPs execute warps.

A block of threads will be executed on a single multiprocessor and multiple blocks can be assigned to each multiprocessor. Do you agree with ?

Also, you should not think of the streaming processors as being able to act independently. SPs behave more like ALUs than processor cores. There is one instruction decoder per streaming multiprocessor, which decodes the instruction for a warp. The instruction propagates down a fairly long pipeline for each SP, which is one reason why a warp has width 32 when there are only 8 SPs per SM. Wide warps make it easier to keep the pipeline full of independent operations.

From the software side, CUDA allows you (the programmer) to focus on the behavior of a thread individually, without vector instructions, when writing a kernel. However, in the underlying hardware, you should think of the SM as analogous to a CPU core, and the SPs correspond to ALUs which can complete a 32-wide vector instruction every 4 clock cycles. On top of that, there is no overhead for the SM to switch between warps (which can come from different blocks), so having the SM run as many threads as possible (768) gives it the best chance to hide global memory latency.

(If your 8800 GTX was an Intel CPU, then they would say it has 16 cores, each core processing 32-wide SSE instructions and supporting 24-way hyperthreading.)

There’s many flavors of GeForce 8800 with different SM counts and even different capabilities, the original GeForce 8800 GTS with G80 GPU had 96 SP (12SM, 8SP per SM), and I think this is the one they were talking about in the presentation, so the 96 parallel threads. External Image

First note that:

  1. The maximum number of active blocks per multiprocessor is 8

  2. The maximum number of active warps per multiprocessor is 24

  3. The maximum number of active threads per multiprocessor is 768

Let me clear this using examples:

Example 1: Suppose we choose size of threads block as 16 x 16 = 256, then as per the condition 3, we have 768/256 = 3 blocks in a multiprocessor. This is legal under condition 1, as 3 < 8, hence we have 3 blocks running in a multiprocessor. That amounts to total 3(blocks) X 256 Threads per Block= 768 active threads

Example 2: If we choose size of threads block as 4 x 4 = 16, then as per condition 3, we have 768/16 = 48 blocks in a multiprocessor. This is against Rule #1 as 48 > 8, therefore we only have 8 blocks in a multiprocessor. That amounts to total 8(blocks) X 16Threads per Block= 128 active threads. So in this case performance degrades by a factor of 6 (768/128=6)

Choosing right block size is very important :-)

Also it is advisable to chose block size in the multiple of 64.

One thing about warps: Threads always executes as warps with 32 threads. So if your block size is 65, then three warps will be running: warp 0 (threadIdx.x 0 to 31), warp 1 (threadIdx.x 32 to 63), warp 2 (threadIdx.x 64 to 95, thread 66 to 95 are the dummy threads); thread Id is contiguous in a block spanning different warps.