As far as I know, if a kernel is launched with many blocks, then each block will map to an SM to run by the GPU. Since SMs execute code independently and asynchronously, blocks are executed asynchronously. Therefore, is it able to overlap data transfer with compute by implementing the following kernel:
while (there are data chunks not consumed by the GPU)
{
1. for each block, the thread with the 0 threadIdx fetch a chunk of data from pinned host memory (using mempcy
inside the kernel)
2. use __syncthreads
to sync all threads of the block
3. all threads of the block then do compute over the trunk
}
I’m new to CUDA and GPU programming. Please correct me if something wrong. :-)