Can't get any concurrency with pattern H2D->K->D2H on several streams

ocwins.z · August 10, 2022, 6:49pm

Thanks for your response.

I agree with you about windows being a trouble maker sometimes. But just telling my teammate we shall switch to linux won’t let them buy me a drink. I have tried to set these RTX3080 to TCC mode without success，and purchasing new cards is not an option at moment.

Back to my question, I think this time it’s not WDDM making the trouble, but how the GPU driver treats RTX3080 is not so right. I have some new discovery after original post.

if we changes the order of the async calls in function ‘fire’, from

    cudaMemcpyAsync(e.dev.i, e.hos.i, InpSize, cudaMemcpyHostToDevice, s); 
    _test<BlockCount, ThreadCount> << < BlockCount, ThreadCount, SmemSize, s >> > (e.dev.i, e.dev.o);
    cudaMemcpyAsync(e.hos.o, e.dev.o, OutSize, cudaMemcpyDeviceToHost, s);
    cudaLaunchHostFunc(s, _callback, reinterpret_cast<void *>(i));

to

    _test<BlockCount, ThreadCount> << < BlockCount, ThreadCount, SmemSize, s >> > (e.dev.i, e.dev.o);
    cudaMemcpyAsync(e.hos.o, e.dev.o, OutSize, cudaMemcpyDeviceToHost, s);
    cudaLaunchHostFunc(s, _callback, reinterpret_cast<void *>(i));
    cudaMemcpyAsync(e.dev.i, e.hos.i, InpSize, cudaMemcpyHostToDevice, s);

the execution overlaps occur again.

My thoughts: the amount of copy engines do matters, or the amount of hardware queue being used do matters.

Assume we do H2D->K->D2H on some streams.

S1:  (H2D-1 KKK-1 D2H-1) (H2D-1 KKK-1 D2H-1) ...
S2:  (H2D-2 KKK-2 D2H-2) (H2D-2 KKK-2 D2H-2) ...

RTX3090 employs all 3 queues, one for kernels, one for d2h, one for h2d, items in these queues arrange like this:

Q1:  H2D-1 H2D-2 H2D-1 H2D-2 H2D-1 ....
Q2:  KKK-1 KKK-2 KKK-1 KKK-2 KKK-1 ....
Q3:  D2H-1 D2H-2 D2H-1 D2H-2 D2H-1 .....

Each stream-x starts with H2D-x, Q1 invokes them one by one, all streams do not affect each other, because a H2D only sits after another H2D in Q1, they never wait for other operations.

RTX3080 with my configuration employs 2 queues, one for kernels, one for d2h and h2d, items arrange like this:

Q1:  H2D-1         D2H-1 H2D-2        D2H-2 H2D-1        D2H-1 .....
Q2:         KKK-1               KKK-2              KKK-1 ...

thus, H2D-2 waits for D2H-1, D2H-1 waits for KKK-1 to complete. So stream-2 won’t start first/next H2D->K->D2H until a H2D->K->D2H in stream-1 completely ended, everything are serialized.

When we changes the order of operations in streams:

S1:  (KKK-1 D2H-1 H2D-1) (KKK-1 D2H-1 H2D-1) ...
S2:  (KKK-2 D2H-2 H2D-2) (KKK-2 D2H-2 H2D-2) ...

in queues:

Q1:          D2H-1 H2D-1           D2H-2 H2D-2           D2H-1 H2D-1 .....
Q2:  KKK-1                 KKK-2                 KKK-1 ...

In this order, KKK-1 still needs previous H2D-1, but this time the necessary H2D-1 has accomplished after previous D2H-1 without any blocking.

This behavior of blocked/serialized version is not right, and RTX3080 shouldn’t have only 1 copy engine. Actually, after found this, I did some search. In this post, you prefer this is a bug. And the poster says that on linux, the copy engine could be correctly recognized, then everything goes fine.

And this post describe the problem better than me, though it’s different hardware:

What he called “bi-directional data transfers overlapping with kernel execution” is important, cudaDeviceProp.asyncEngineCount is important, especially for the pattern H2D->K->D2H, but people rarely say that.

Though the manually operations-reordering trick works, but not all jobs could be done this way (the sample above is just a simplest demo), and different jobs have different level of difficulty depending on their conditions/situations/business logic. So I think that there are two things :

clarify this in elementary courses, but not simply “H2D->K->D2H works well”, users should learn more about details of copy engines/queues. At least a side note is necessary.
fix the bug on some cards. With hardware scheduling enabled, RTX3080 is reported with 1 copy engines. With hardware scheduling disabled, RTX3080 is reported with 5 copy engines, but it still can’t make the overlap. Windows/WDDM has nothing to do with this, what I guess is the GPU driver confused itself.

Typing so much is not just for blaming NV. There are few materials about this topic online. So I engaged myself to express my view/thoughts. And more IMPORTANT, I’m not sure what I said in this post is right. Please correct my mistakes and give better answers. Any hints are appreciated. Let us make it clear, once for all.

Sorry for my poor English. And to @Robert_Crovella，I have viewed many your posts/replies in past years, and you helped me a lot though you even don’t know me. Taking this opportunity, I must say Thank you very much!

Topic		Replies	Views
Which GPU for best performance with TCC and CUDA cores (no tensors) CUDA Programming and Performance	30	347	December 6, 2024
Why doesn't overlapping data transfers and kernel execution work here? CUDA Programming and Performance	88	266	April 8, 2025
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2235	January 18, 2023
Cannot get any stream parallelism. CUDA Programming and Performance	13	1294	December 31, 2019
Overlapping CPU and GPU operations using streams. Total failure. Any help? CUDA Programming and Performance	6	6025	April 2, 2013
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	9589	December 4, 2017
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009
cudaMemcpy latency unusually high on some machines CUDA Programming and Performance	9	147	November 11, 2024
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	580	April 6, 2023
An Even Easier Introduction to CUDA Technical Blog	141	6339	November 28, 2023

Can't get any concurrency with pattern H2D->K->D2H on several streams

Related topics