How do CUDA cores on a SM execute warps concurrently?

As I know, this is the main control flow of Cuda program:

Kernel --> thread block(s) --> one block executes by a SM one time --> thread block is divided into warps(32 threads per warp) --> all warps are handled concurrently (is this mean parallel?)

So now assumes that we are using the Fermi architecture which implements 1536 threads per block.

1536/32 = 48 warps per SM
A half of 48 warps is dispatched into 1 of 2 warp schedulers based on there ID (odd or even)
Now we have 24 warps/16 CUDA cores each warp scheduler

They say that all warps are run concurrently, so my question is: how can 16 CUDA cores execute 24 warps concurrently? Whether concurrently is different from parallel or not?

Warps on a block are executed in cooperatively. When a warp gets stuck on a memory access, scheduler picks up another warp to execute. Sort of the same way hyperthreading works on intel processors.

So, one cuda core is occupied by a thread (of one warp), and the other threads are waiting to get into cuda core for execution --> concurrently?
And I have just read that 16 cuda core of each warp scheduler together execute a warp by 2 cycles (16 cores execute 16 threads of each warp by 1 cycle). After this warp finish its instruction, one of the other warps then replace it and get into 16 core for execution. And it make a tour of 24 warps to execute 1 instruction each warp, then start again till finish all there instruction?

You first get all threads of a warp executed by an sm in sync (on pre-fermi hw for example 8 cuda cores of one sm execute 32 warp threads in 4 cycles). Then, if the last instruction executed introduces a stall (mem access, or __syncthreads or something else), another warp is picked up. If there is no stall, then SM keeps on executing the same warp I believe. Thus 1 warp can actually run ahead of other warps. Use __syncthreads to synchronize them if you want to make sure all warps are in the same place in your code.

I’d believe the warp scheduler follows a fairer policy than that or performance would be hit quite badly. All of this is undocumented though.

I think the depth of the pipeline indicates that the scheduler has to spread around execution among many warps at the instruction level to avoid pipeline bubbles and large branch penalties.

Remember: Alternating between normal threads on a CPU requires a context switch to the OS and many cycles to store the state of the CPU (registers, etc) to memory somewhere. With CUDA, all the registers, shared memory and local memory for the entire block of threads are reserved at the start of the block. “Switching” between warps on every instruction requires no expensive memory operations, but does need some hardware in the SM to manage the list of active and stalled warps. This is why defining “concurrency” in the context of a SM is not straightforward. All active warps are getting executed concurrently, but each warp has an independent instruction pointer.

Can you give me any document about the operation of Warp Scheduler, Instruction Dispatch Unit (how they work?). Because my I have a project about GPU Architecture (specific to Fermi Architecture), and I need to understand it clearly to do well the project (I have found many documents, but almost of them are not clear, and I have to guess the way that they work)!

I was interested in learning more about the warp scheduler tonight and came across the following document. It satisfied most of my curiosity; hopefully it’ll be of help to you too.

I was interested in learning more about the warp scheduler tonight and came across the following document. It satisfied most of my curiosity; hopefully it’ll be of help to you too.


This is interesting, but appears to be about some possible improvements to a future GPU.