I am trying to understand the execution model and instruction and thread scheduling in cuda. Also, I am trying to understand the overlapping of computation and data movement from global to shared memory. Within a warp of size 32, we have 4 sets of 8 threads. How is the execution cycled through these 4 threads. Is there a periodic timeslicing in a round robin manner, what is the granularity of this time slice? Also, is there a single program counter maintained per warp? Across warps and thread blocks, how is the execution progression made and what is the granularity of the time slice?
Thanks in advance,