Can I use blocks instaed of streams to overlap data transfer with compute?

huaouo · September 9, 2021, 9:19am

As far as I know, if a kernel is launched with many blocks, then each block will map to an SM to run by the GPU. Since SMs execute code independently and asynchronously, blocks are executed asynchronously. Therefore, is it able to overlap data transfer with compute by implementing the following kernel:

while (there are data chunks not consumed by the GPU)
{
1. for each block, the thread with the 0 threadIdx fetch a chunk of data from pinned host memory (using mempcy inside the kernel)
2. use __syncthreads to sync all threads of the block
3. all threads of the block then do compute over the trunk
}

I’m new to CUDA and GPU programming. Please correct me if something wrong. :-)

Robert_Crovella · September 9, 2021, 12:58pm

Yes, something like that should be workable. For performance reasons I would use at least a warp to fetch the data. For good overlap possibility, the ratio of work to data transfer cost must be somewhat balanced. And I’m not sure naive usage of __syncthreads() would be the right way to go for this producer-consumer model and still support overlap, however the idea is correct, you need a sync mechanism of some sort.

Topic		Replies	Views
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3170	July 14, 2010
Using streams... Howto? CUDA Programming and Performance	0	1114	July 25, 2008
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2275	January 18, 2023
About Synchronize CUDA Programming and Performance	4	1450	March 26, 2009
Combination of "Overlap of Data Transfer" and "Concurrent Kernel Execution" CUDA Programming and Performance	1	1315	September 14, 2011
Don't observe overlapping behavior in streams CUDA Programming and Performance cuda , kernel	3	28	July 15, 2025
Concurrent Kernel executions & Data Transfers CUDA Programming and Performance cuda	3	652	March 8, 2023
Strange behavior with overlap of transfer and compute CUDA Programming and Performance	4	3951	October 19, 2011
How to Overlap Data Transfers in CUDA Fortran Technical Blog	0	409	August 25, 2020
Global Sync CUDA Programming and Performance	7	6001	October 4, 2007

Can I use blocks instaed of streams to overlap data transfer with compute?

Related topics