Can I use blocks instaed of streams to overlap data transfer with compute?

As far as I know, if a kernel is launched with many blocks, then each block will map to an SM to run by the GPU. Since SMs execute code independently and asynchronously, blocks are executed asynchronously. Therefore, is it able to overlap data transfer with compute by implementing the following kernel:

while (there are data chunks not consumed by the GPU)
1. for each block, the thread with the 0 threadIdx fetch a chunk of data from pinned host memory (using mempcy inside the kernel)
2. use __syncthreads to sync all threads of the block
3. all threads of the block then do compute over the trunk

I’m new to CUDA and GPU programming. Please correct me if something wrong. :-)

Yes, something like that should be workable. For performance reasons I would use at least a warp to fetch the data. For good overlap possibility, the ratio of work to data transfer cost must be somewhat balanced. And I’m not sure naive usage of __syncthreads() would be the right way to go for this producer-consumer model and still support overlap, however the idea is correct, you need a sync mechanism of some sort.