Why doesn't overlapping data transfers and kernel execution work here?

Curefab · March 6, 2025, 1:44pm

Okay, let’s do another trick on top to get parallelism:

Instead of two streams, just use one stream. Instead of 3 kernels use 1 kernel and assign the function according to threadIdx.x.

H2D() still copies the next block, and D2H() copies back the previous block.

numBlocks = 20;
blockSize = 1024:

(alternatively 40 blocks with 768 threads each)

__device__ copy()
{
    if(threadIdx.x < 896)
        H2D(128, threadIdx.x - 768);
    else
        D2H(128, threadIdx.x - 896);
}

__global__ allInOneKernel()
{
    if (threadIdx.x < 768)
        compute(768, threadIdx.x);
    else
        copy();
}

Just be careful that threadIdx.x and blockIdx.x have to be interpreted slightly differently. I tried to show it in the parameters of the device functions.

PS: The reason for the kernels not overlapping probably is how Windows has internal work pipelines to communicate with the GPU and forward tasks. This sometimes leads to delays. Perhaps it could also be solved by changing some Windows settings or perhaps by using Cuda graphs? Perhaps also by inserting a delay. It is only the first kernel invocation after all, so perhaps you can also just accept it.

hstlr · March 6, 2025, 4:05pm

That looks neat! Thanks for the hint, I will try that.

Yeah, that crossed my mind, too. Apart from the “initial” delay, the rest of the profiling session looks pretty good. In the next step I will implement a better time measuring mechanism I saw in some Nvidia example so I can better compare my overlapping Linux version with the custom-copy-kernel version under Windows. If the custom version comes anywhere close to the original overlapping version, I suppose I’ll accept it.

hstlr · March 13, 2025, 11:19am

I conducted a series of performance tests, aiming to compare various configurations for my copy kernel, that I would like to share.

General setup:

no. of test points: 50,000,000 (381.47 MiB)
chunksize: 5,000,000 (38.15 MiB)
streams: 1

For each grid configuration, I launched 2 test runs, following each other as closely as Nsight System would allow. Each run contained 10 launches of cpyToDeviceKernel, copying 1 chunk of data each. The mentioned durations (min/max/avg) each refer to 1 transferred chunk of 5,000,000 PointXY objects.

Tests utilizing mapped memory:

numBlocks = 40

40 x 1024 threads, run 1:

min: 12.753 ms
max: 13.738 ms
avg: 13.285 ms

40 x 1024 threads, run 2:

min: 12.571 ms
max: 13.977 ms
avg: 13.209 ms

40 x 768 threads, run 1:

min: 12.72 ms
max: 18.798 ms
avg: 13.96 ms

40 x 768 threads, run 2:

min: 12.558 ms
max: 14.173 ms
avg: 13.165 ms

40 x 512 threads, run 1:

min: 12.005 ms
max: 18.813 ms
avg: 16.538 ms

40 x 512 threads, run 2:

min: 13.684 ms
max: 19.375 ms
avg: 17.213 ms

40 x 256 threads, run 1:

min: 11.006 ms
max: 18.914 ms
avg: 16.664 ms

40 x 256 threads, run 2:

min: 12.25 ms
max: 18.7 ms
avg: 16.147 ms

numBlocks = 60

60 x 1024 threads, run 1:

min: 12.162 ms
max: 18.889 ms
avg: 16.354 ms

60 x 1024 threads, run 2:

min: 17.766 ms
max: 20.974 ms
avg: 18.638 ms

60 x 768 threads, run 1:

min: 16.505 ms
max: 20.962 ms
avg: 17.678 ms

60 x 768 threads, run 2:

min: 17.074 ms
max: 20.137 ms
avg: 17.94 ms

60 x 512 threads, run 1:

min: 16.7 ms
max: 20.031 ms
avg: 17.854 ms

60 x 512 threads, run 2:

min: 17.144 ms
max: 21.131 ms
avg: 18.632 ms

60 x 256 threads, run 1:

min: 17.184 ms
max: 19.288 ms
avg: 17.875 ms

60 x 256 threads, run 2:

min: 14.125 ms
max: 19.766 ms
avg: 17.982 ms

Tests utilizing pinned memory:

numBlocks = 40

40 x 1024 threads, run 1:

min: 8.172 ms
max: 9.339 ms
avg: 8.523 ms

40 x 1024 threads, run 2:

min: 8.208 ms
max: 9.193 ms
avg: 8.501 ms

40 x 768 threads, run 1:

min: 8.194 ms
max: 9.233 ms
avg: 8.42 ms

40 x 768 threads, run 2:

min: 8.169 ms
max: 8.739 ms
avg: 8.398 ms

40 x 512 threads, run 1:

min: 7.946 ms
max: 9.618 ms
avg: 9.005 ms

40 x 512 threads, run 2:

min: 7.935 ms
max: 10.512 ms
avg: 9.951 ms

40 x 256 threads, run 1:

min: 5.013 ms
max: 7.819 ms
avg: 5.939 ms

40 x 256 threads, run 2:

min: 4.344 ms
max: 7.478 ms
avg: 5.853 ms

numBlocks = 60

60 x 1024 threads, run 1:

min: 6.439 ms
max: 9.17 ms
avg: 7.857 ms

60 x 1024 threads, run 2:

min: 6.679 ms
max: 11.305 ms
avg: 8.461 ms

60 x 768 threads, run 1:

min: 6.403 ms
max: 11.724 ms
avg: 8.698 ms

60 x 768 threads, run 2:

min: 7.755 ms
max: 9.338 ms
avg: 8.822 ms

60 x 512 threads, run 1:

min: 6.521 ms
max: 9.287 ms
avg: 8.096 ms

60 x 512 threads, run 2:

min: 4.909 ms
max: 9.07 ms
avg: 6.963 ms

60 x 256 threads, run 1:

min: 4.084 ms
max: 8.099 ms
avg: 6.459 ms

60 x 256 threads, run 2:

min: 6.143 ms
max: 8.376 ms
avg: 7.622 ms

For reference, cudaMemcpyAsync() with the same chunksize:

cudaMemcpyAsync()

run 1:

min: 5.306 ms
max: 6.142 ms
avg: 5.492 ms

run 2:

min: 5.305 ms
max: 6.047 ms
avg: 5.667 ms

Conclusion:

Obviously, pinned memory holds a significant advantage over mapped memory concerning the copy speed. We had already gathered that.
The results for the same configuration sometimes differ a lot from run 1 to run 2. That’s the reason why I took 2 rounds for each configuration in the first place. Still, it surprises me to see that especially the best-case and worst-case results can diverge that much.
Talking about the best-case and the average-case results, the smaller blocksizes (x512 and x256 threads) seem to give the best results. I will see if I can get close to these when using a more production-like configuration with 2 or 3 streams.
The best results using cpyToDeviceKernel are pretty close to what I got out of using cudaMemcpyAsync. I find this a promising observation.

Curefab · March 13, 2025, 11:58am

Wow. Nice overview. And very good results that you matched the cudaMemcpyAsync speed!

Not sure how much of min/max outliers is from the measurement method.
You can measure times on the GPU with the global timer or in the stream with a host callback.

Perhaps you want to add test runs for even less threads than 256?

You will need the threads as soon as you do parallel computation. Although 256 is low enough.

hstlr · March 13, 2025, 12:08pm

I’m currently using the std::chrono::high_resolution_clock because I want to be able to measure CPU and GPU calculations using the same method. Ultimately, I want to compare the time CPU and GPU need performing the same task.

Sure, I can try that.
By the way, I’m just testing with 2/3 streams and different grid configurations again. You said blockSize should always be a multiple of 128. Where did you get this constraint? Wouldn’t using a multiple of the warpSize (=32) be possible?

Curefab · March 13, 2025, 12:21pm

Sounds good. So this is actual runtime variation (probably still due to Windows), not some Nsight Systems artifact.

Yes. But I fear that kernels in other streams will not be able to use the freed threads (I am not 100% sure). You could try it out.
So even if the copy kernel is smaller, perhaps the computation kernel cannot be larger and still be running at the same time.
There are some hints that the reservation of resources is not independent between the partitions.

We have 4 SM Partitions per SM. Apart from whether it is possible, should we fill the partitions evenly?

For compute, one tries to divide the work evenly on the partitions, as each has their own computational units.
For (slow) memory accesses, it probably would not matter to distribute them evenly?
However, it could be that the number of transactions in flight (for slow latency) fills pipelines within the partitions.

hstlr · March 13, 2025, 1:31pm

I have to correct myself here: I am indeed using the high_resolution_clock, but only for measuring the total time needed for a complete run of H2D, computation, D2H. That is what I’m finally going to compare amongst different CPUs and GPUs. The measurings of the performance tests above are actually taken from Nsight Systems. I got that mixed up, sorry for the confusion.

Hm, okay. Maybe I will try this out. It would give me more flexibility when trying to find the optimal grid configuration while trying to prevent copying operations from completely occupying the GPU. On the other hand, that makes loads of additional test cases :-D

hstlr · March 25, 2025, 4:05pm

As the original question has been answered already, I’ll leave this thread to R.I.P. and suggest moving any ongoing discussions concerning further optimizations here :-)

system · April 8, 2025, 4:06pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can't get any concurrency on simple vector add across multi-GPU and streams CUDA Programming and Performance	17	5918	April 28, 2012
cpu function and gpu kernel overlap CUDA Programming and Performance	12	1463	July 25, 2017
how to speed up? data transfer CUDA Programming and Performance	22	4013	April 5, 2011
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13260	March 30, 2011
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2325	May 30, 2009
bets way to return a float value sync or assync CUDA Programming and Performance	26	10472	May 7, 2009
cudaMemcpyAsync CUDA Programming and Performance	10	21591	October 16, 2015
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29482	April 23, 2010
about streaming style sample code in Programming Guide ... why such a style? CUDA Programming and Performance	5	1465	January 23, 2009
multi-GPUs with streams. Seems only one device overlapping copies CUDA Programming and Performance	9	1696	October 30, 2015

Why doesn't overlapping data transfers and kernel execution work here?

Related topics