How do CUDA cores on a SM execute warps concurrently?

xme · February 1, 2011, 4:52am

As I know, this is the main control flow of Cuda program:

Kernel → thread block(s) → one block executes by a SM one time → thread block is divided into warps(32 threads per warp) → all warps are handled concurrently (is this mean parallel?)

So now assumes that we are using the Fermi architecture which implements 1536 threads per block.

1536/32 = 48 warps per SM
A half of 48 warps is dispatched into 1 of 2 warp schedulers based on there ID (odd or even)
Now we have 24 warps/16 CUDA cores each warp scheduler

They say that all warps are run concurrently, so my question is: how can 16 CUDA cores execute 24 warps concurrently? Whether concurrently is different from parallel or not?

sergeyn · February 1, 2011, 4:27pm

Warps on a block are executed in cooperatively. When a warp gets stuck on a memory access, scheduler picks up another warp to execute. Sort of the same way hyperthreading works on intel processors.

xme · February 2, 2011, 2:16am

So, one cuda core is occupied by a thread (of one warp), and the other threads are waiting to get into cuda core for execution → concurrently?
And I have just read that 16 cuda core of each warp scheduler together execute a warp by 2 cycles (16 cores execute 16 threads of each warp by 1 cycle). After this warp finish its instruction, one of the other warps then replace it and get into 16 core for execution. And it make a tour of 24 warps to execute 1 instruction each warp, then start again till finish all there instruction?

sergeyn · February 2, 2011, 10:40am

You first get all threads of a warp executed by an sm in sync (on pre-fermi hw for example 8 cuda cores of one sm execute 32 warp threads in 4 cycles). Then, if the last instruction executed introduces a stall (mem access, or __syncthreads or something else), another warp is picked up. If there is no stall, then SM keeps on executing the same warp I believe. Thus 1 warp can actually run ahead of other warps. Use __syncthreads to synchronize them if you want to make sure all warps are in the same place in your code.

tera · February 2, 2011, 10:45am

I’d believe the warp scheduler follows a fairer policy than that or performance would be hit quite badly. All of this is undocumented though.

seibert · February 2, 2011, 6:50pm

I think the depth of the pipeline indicates that the scheduler has to spread around execution among many warps at the instruction level to avoid pipeline bubbles and large branch penalties.

Remember: Alternating between normal threads on a CPU requires a context switch to the OS and many cycles to store the state of the CPU (registers, etc) to memory somewhere. With CUDA, all the registers, shared memory and local memory for the entire block of threads are reserved at the start of the block. “Switching” between warps on every instruction requires no expensive memory operations, but does need some hardware in the SM to manage the list of active and stalled warps. This is why defining “concurrency” in the context of a SM is not straightforward. All active warps are getting executed concurrently, but each warp has an independent instruction pointer.

xme · February 4, 2011, 4:51pm

I think the depth of the pipeline indicates that the scheduler has to spread around execution among many warps at the instruction level to avoid pipeline bubbles and large branch penalties.

Remember: Alternating between normal threads on a CPU requires a context switch to the OS and many cycles to store the state of the CPU (registers, etc) to memory somewhere. With CUDA, all the registers, shared memory and local memory for the entire block of threads are reserved at the start of the block. “Switching” between warps on every instruction requires no expensive memory operations, but does need some hardware in the SM to manage the list of active and stalled warps. This is why defining “concurrency” in the context of a SM is not straightforward. All active warps are getting executed concurrently, but each warp has an independent instruction pointer.

Can you give me any document about the operation of Warp Scheduler, Instruction Dispatch Unit (how they work?). Because my I have a project about GPU Architecture (specific to Fermi Architecture), and I need to understand it clearly to do well the project (I have found many documents, but almost of them are not clear, and I have to guess the way that they work)!

kluikens · March 24, 2011, 7:35am

I was interested in learning more about the warp scheduler tonight and came across the following document. It satisfied most of my curiosity; hopefully it’ll be of help to you too.

http://www.microarch.org/micro40/talks/7-3.ppt

michaelrrolle45 · July 4, 2019, 2:08am

I was interested in learning more about the warp scheduler tonight and came across the following document. It satisfied most of my curiosity; hopefully it’ll be of help to you too.

http://www.microarch.org/micro40/talks/7-3.ppt

[/quote]

This is interesting, but appears to be about some possible improvements to a future GPU.

Topic		Replies	Views
How many parallel threads? CUDA Programming and Performance	19	9960	October 1, 2021
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15580	February 4, 2011
Blocks/Warps/Threads Allocation I have some doubts about the allocation of blocks/warps/thread in CU CUDA Programming and Performance	5	2574	November 1, 2012
Warp scheduling - have I got this right? CUDA Programming and Performance	17	12159	February 12, 2013
Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu CUDA Programming and Performance	7	5563	July 28, 2009
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2171	March 19, 2011
Understanding CUDA scheduling CUDA Programming and Performance	4	15432	May 20, 2014
How more exactly a thread is executed on GPU CUDA Programming and Performance	9	2991	March 7, 2017
Thread Scheduling Concept CUDA Programming and Performance	3	3709	June 21, 2012
Basic question about warps CUDA Programming and Performance	14	6585	June 9, 2009

How do CUDA cores on a SM execute warps concurrently?

Related topics