help me understand cuda

scatman · February 6, 2010, 9:55am

i am having some troubles understanding threads in NVIDIA gpu architecture with cuda.

please could anybody clarify these info: an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs.

i was viewing Stanford’s video presentation and it was saying that every SP is capable of running 96 threads concurrently. does this mean that it (SP) can run 96/32=3 warps concurrently?

moreover, since every SP can run 96 threads and we have 8 SPs in every SM. does this mean that every SM can run 96*8=768 threads concurrently?? but if every SM can run a single Block at a time, and the maximum number of threads in a block is 512, so what is the purpose of running 768 threads concurrently and have a max of 512 threads?

a more general question is:how are blocks,threads,and warps distributed to SMs and SPs? i read that every SM gets a single block to execute at a time and threads in a block is divided into warps (32 threads), and SPs execute warps.

heshsham_India · February 6, 2010, 10:07am

A block of threads will be executed on a single multiprocessor and multiple blocks can be assigned to each multiprocessor. Do you agree with ?

seibert · February 6, 2010, 5:26pm

Also, you should not think of the streaming processors as being able to act independently. SPs behave more like ALUs than processor cores. There is one instruction decoder per streaming multiprocessor, which decodes the instruction for a warp. The instruction propagates down a fairly long pipeline for each SP, which is one reason why a warp has width 32 when there are only 8 SPs per SM. Wide warps make it easier to keep the pipeline full of independent operations.

From the software side, CUDA allows you (the programmer) to focus on the behavior of a thread individually, without vector instructions, when writing a kernel. However, in the underlying hardware, you should think of the SM as analogous to a CPU core, and the SPs correspond to ALUs which can complete a 32-wide vector instruction every 4 clock cycles. On top of that, there is no overhead for the SM to switch between warps (which can come from different blocks), so having the SM run as many threads as possible (768) gives it the best chance to hide global memory latency.

(If your 8800 GTX was an Intel CPU, then they would say it has 16 cores, each core processing 32-wide SSE instructions and supporting 24-way hyperthreading.)

parallelis · February 9, 2010, 1:24am

There’s many flavors of GeForce 8800 with different SM counts and even different capabilities, the original GeForce 8800 GTS with G80 GPU had 96 SP (12SM, 8SP per SM), and I think this is the one they were talking about in the presentation, so the 96 parallel threads. External Image

heshsham_India · February 10, 2010, 2:01pm

First note that:

The maximum number of active blocks per multiprocessor is 8
The maximum number of active warps per multiprocessor is 24
The maximum number of active threads per multiprocessor is 768

Let me clear this using examples:

Example 1: Suppose we choose size of threads block as 16 x 16 = 256, then as per the condition 3, we have 768/256 = 3 blocks in a multiprocessor. This is legal under condition 1, as 3 < 8, hence we have 3 blocks running in a multiprocessor. That amounts to total 3(blocks) X 256 Threads per Block= 768 active threads

Example 2: If we choose size of threads block as 4 x 4 = 16, then as per condition 3, we have 768/16 = 48 blocks in a multiprocessor. This is against Rule #1 as 48 > 8, therefore we only have 8 blocks in a multiprocessor. That amounts to total 8(blocks) X 16Threads per Block= 128 active threads. So in this case performance degrades by a factor of 6 (768/128=6)

Choosing right block size is very important :-)

Also it is advisable to chose block size in the multiple of 64.

One thing about warps: Threads always executes as warps with 32 threads. So if your block size is 65, then three warps will be running: warp 0 (threadIdx.x 0 to 31), warp 1 (threadIdx.x 32 to 63), warp 2 (threadIdx.x 64 to 95, thread 66 to 95 are the dummy threads); thread Id is contiguous in a block spanning different warps.

Topic		Replies	Views
How more exactly a thread is executed on GPU CUDA Programming and Performance	9	2992	March 7, 2017
Basic Cuda Confusion - help CUDA Programming and Performance	9	1902	February 11, 2013
Cuda Cores Cuda Cores - run threads bloocks, kernels etc. CUDA Programming and Performance	5	1744	February 22, 2011
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2171	March 19, 2011
how many threads concurrently run at a clock? CUDA Programming and Performance	3	1427	April 15, 2009
Thread Scheduling Concept CUDA Programming and Performance	3	3715	June 21, 2012
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19656	July 5, 2011
How many thread are executed at the same time ? CUDA Programming and Performance	9	7873	January 21, 2024
768 threads vs warp CUDA Programming and Performance	2	1461	August 16, 2009
What is the difference between SP and CUDA core? CUDA Programming and Performance	7	7661	October 12, 2021

help me understand cuda

Related topics