Can't get any concurrency with pattern H2D->K->D2H on several streams

Thanks for your response.

I agree with you about windows being a trouble maker sometimes. But just telling my teammate we shall switch to linux won’t let them buy me a drink. I have tried to set these RTX3080 to TCC mode without success,and purchasing new cards is not an option at moment.

Back to my question, I think this time it’s not WDDM making the trouble, but how the GPU driver treats RTX3080 is not so right. I have some new discovery after original post.

if we changes the order of the async calls in function ‘fire’, from

    cudaMemcpyAsync(e.dev.i, e.hos.i, InpSize, cudaMemcpyHostToDevice, s); 
    _test<BlockCount, ThreadCount> << < BlockCount, ThreadCount, SmemSize, s >> > (e.dev.i, e.dev.o);
    cudaMemcpyAsync(e.hos.o, e.dev.o, OutSize, cudaMemcpyDeviceToHost, s);
    cudaLaunchHostFunc(s, _callback, reinterpret_cast<void *>(i));

to

    _test<BlockCount, ThreadCount> << < BlockCount, ThreadCount, SmemSize, s >> > (e.dev.i, e.dev.o);
    cudaMemcpyAsync(e.hos.o, e.dev.o, OutSize, cudaMemcpyDeviceToHost, s);
    cudaLaunchHostFunc(s, _callback, reinterpret_cast<void *>(i));
    cudaMemcpyAsync(e.dev.i, e.hos.i, InpSize, cudaMemcpyHostToDevice, s); 

the execution overlaps occur again.


My thoughts: the amount of copy engines do matters, or the amount of hardware queue being used do matters.

Assume we do H2D->K->D2H on some streams.

S1:  (H2D-1 KKK-1 D2H-1) (H2D-1 KKK-1 D2H-1) ...
S2:  (H2D-2 KKK-2 D2H-2) (H2D-2 KKK-2 D2H-2) ...

RTX3090 employs all 3 queues, one for kernels, one for d2h, one for h2d, items in these queues arrange like this:

Q1:  H2D-1 H2D-2 H2D-1 H2D-2 H2D-1 ....
Q2:  KKK-1 KKK-2 KKK-1 KKK-2 KKK-1 ....
Q3:  D2H-1 D2H-2 D2H-1 D2H-2 D2H-1 .....

Each stream-x starts with H2D-x, Q1 invokes them one by one, all streams do not affect each other, because a H2D only sits after another H2D in Q1, they never wait for other operations.

RTX3080 with my configuration employs 2 queues, one for kernels, one for d2h and h2d, items arrange like this:

Q1:  H2D-1         D2H-1 H2D-2        D2H-2 H2D-1        D2H-1 .....
Q2:         KKK-1               KKK-2              KKK-1 ...

thus, H2D-2 waits for D2H-1, D2H-1 waits for KKK-1 to complete. So stream-2 won’t start first/next H2D->K->D2H until a H2D->K->D2H in stream-1 completely ended, everything are serialized.

When we changes the order of operations in streams:

S1:  (KKK-1 D2H-1 H2D-1) (KKK-1 D2H-1 H2D-1) ...
S2:  (KKK-2 D2H-2 H2D-2) (KKK-2 D2H-2 H2D-2) ...

in queues:

Q1:          D2H-1 H2D-1           D2H-2 H2D-2           D2H-1 H2D-1 .....
Q2:  KKK-1                 KKK-2                 KKK-1 ...

In this order, KKK-1 still needs previous H2D-1, but this time the necessary H2D-1 has accomplished after previous D2H-1 without any blocking.

This behavior of blocked/serialized version is not right, and RTX3080 shouldn’t have only 1 copy engine. Actually, after found this, I did some search. In this post, you prefer this is a bug. And the poster says that on linux, the copy engine could be correctly recognized, then everything goes fine.

And this post describe the problem better than me, though it’s different hardware:

What he called “bi-directional data transfers overlapping with kernel execution” is important, cudaDeviceProp.asyncEngineCount is important, especially for the pattern H2D->K->D2H, but people rarely say that.

Though the manually operations-reordering trick works, but not all jobs could be done this way (the sample above is just a simplest demo), and different jobs have different level of difficulty depending on their conditions/situations/business logic. So I think that there are two things :

  1. clarify this in elementary courses, but not simply “H2D->K->D2H works well”, users should learn more about details of copy engines/queues. At least a side note is necessary.

  2. fix the bug on some cards. With hardware scheduling enabled, RTX3080 is reported with 1 copy engines. With hardware scheduling disabled, RTX3080 is reported with 5 copy engines, but it still can’t make the overlap. Windows/WDDM has nothing to do with this, what I guess is the GPU driver confused itself.

Typing so much is not just for blaming NV. There are few materials about this topic online. So I engaged myself to express my view/thoughts. And more IMPORTANT, I’m not sure what I said in this post is right. Please correct my mistakes and give better answers. Any hints are appreciated. Let us make it clear, once for all.

Sorry for my poor English. And to @Robert_Crovella,I have viewed many your posts/replies in past years, and you helped me a lot though you even don’t know me. Taking this opportunity, I must say Thank you very much!

1 Like