768 threads vs warp

cudacuda2009 · August 16, 2009, 10:55am

Hi,

Well I am a little bit confused when it comes to the following facts;

1- A Streaming Multiprocessor has 8 scalar processors. so how can a warp having 32 threads run simultaneously? Does’nt that means 4 threads run simultaneously on a single scalar processor.

2- What is 768 threads then? It is mentioned at several places that each SM can accommodate up to 768 threads. But then what is warp?

3-My question is if I ask you what is the Maximum Number of threads that can run simultaneous on a single SM, what will be your answer 32 (one warp ) OR 768?

cbuchner1 · August 16, 2009, 11:02am

I think the number 768 is the scheduling queue length of a single multiprocessor (compute capability 1.1) which can accomodate up to 3 thread blocks of 256 threads simultaneously. The scheduling is done in terms of warps, so there are 24 scheduling “slots” available in the hardware (24*32 = 768). I believe compute 1.2 and later devices can handle more than 3 thread blocks simultaneously. I have no idea if the also extended the length of the scheduling queue.

The 32 threads of a warp execute in a quasi-parallel fashion on 8 “cuda processors” (ALUs) which takes 4 clock cycles or more (depending on the type of instruction that is being executed).

Fell free to correct me if I’ve misstated some fact.

_Big_Mac · August 16, 2009, 11:16am

Basically, it takes 4 cycles for the SM to issues an instruction to its 8 SPs. So, to keep the hardware saturated, each SP will apply this instruction to 4 threads in sequence. They don’t physically run simultaneously on the SPs but in a sort of a pipeline instead.
768 threads is 24 warps. Up to 24 (32 in newer hardware) warps can be issued to a SM in any given moment (given enough resources). Those warps can be from different blocks.
That depends on your definition of “simultaneous”, as weird as that may sound.
An SM is free to switch to any of the queued warps at any time and does that to hide latencies. In this perspective, there are 768 threads per SM “in-flight”, with allocated registers, ready to be processed. You could say they are being worked on simultaneously.
During any given 4 cycles, the SM issues an instruction to a single warp. In this perspective, 32 threads are being worked on simultaneously by the 8 SPs in a pipeline.
Each cycle, 8 results of a single instruction “pop out” of the 8 SPs. In this perspective, 8 threads simultaneously.

The answer I’d give is 32 but all of the 768 threads are indeed used. The effective throughput increases thanks to having those 24 warps scheduled as compared with only having the one that’s executing.

Topic		Replies	Views
how many threads concurrently run at a clock? CUDA Programming and Performance	3	1425	April 15, 2009
How they work betweem SM and block SM, SP, Block, Thread and so on. CUDA Programming and Performance	1	4318	January 8, 2008
No.of threads per scalar processor CUDA Programming and Performance	6	6485	July 10, 2009
SP and Warp CUDA Programming and Performance	3	3402	May 2, 2010
number of simultaneous threads CUDA Programming and Performance	7	3439	February 26, 2010
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2163	March 19, 2011
About Warps how Warps are allocated to SP/SM CUDA Programming and Performance	2	8303	September 11, 2009
A question the parallelization CUDA Programming and Performance	5	2694	July 29, 2008
help me understand cuda CUDA Programming and Performance	4	6873	February 10, 2010
1 MP has 8 SP, but warp size is 32! CUDA Programming and Performance	6	3439	January 22, 2009

768 threads vs warp

Related topics