As I know, this is the main control flow of Cuda program:
Kernel --> thread block(s) --> one block executes by a SM one time --> thread block is divided into warps(32 threads per warp) --> all warps are handled concurrently (is this mean parallel?)
So now assumes that we are using the Fermi architecture which implements 1536 threads per block.
1536/32 = 48 warps per SM
A half of 48 warps is dispatched into 1 of 2 warp schedulers based on there ID (odd or even)
Now we have 24 warps/16 CUDA cores each warp scheduler
They say that all warps are run concurrently, so my question is: how can 16 CUDA cores execute 24 warps concurrently? Whether concurrently is different from parallel or not?