Okay, so I’m new at this too, but I’d like to try to help as much as I can.
The Jetson TK1 has a single Kepler SM (192 CUDA cores).
When you launch your kernel, the GPU will map each block onto that SM. The scheduling is done automatically. I am not sure if all threads in the Block must complete execution before another Block can run on the SM. I would assume all threads in the block much complete before applying another block to the SM, because if another block was scheduled to execute on the SM then the unfinished threads in that first block could not be executing.
So inside each block that is executing on the SM, the block is split up into Warps (groups of 32 threads). Now I know that if one of the threads in that warp must wait on another thread from another warp to finish, then the warp will be “context switched” with another warp, and these warps will be moved back and forth which the GPU does all this scheduling for you.
Each thread is mapped to a single CUDA core which inside that thread the SINGLE INSTRUCTION that is performed is the operations defined inside the kernels body. This is the concept of SIMD. The same istruction (the kernel) is executed on MULTIPLE DATA (the input arguments to the kernel that is executed as that specific thread on one singe CUDA core).
Does that help at all?