Does cudaLaunchHostFunc block work added to all streams?

aous72 · March 7, 2021, 12:59am

In the following program, I have created two non-blocking streams, stream0 and stream1.
I noticed that enqueuing cudaLaunchHostFunc in stream0 will block newly added work in the other stream (i.e., stream1). This can be checked from the timeline produced by the Visual Profiler.

Making the cudaLaunchHostFunc call after stall_kern for stream1, allows both streams to run concurrently.

Is this the expected behavior, or am I missing something?
This behavior is not very useful; one would expect that streams operate independently unless there is something that ties them together such as an event.

Thank you.

Screenshot 2021-03-07 115839

#include "cuda_runtime.h"
#include <cstdio>

__global__ void stall_kern(int time_diff)
{
  const long long int u = clock64() + time_diff;
  long long int t;
	do {
		t = clock64();
	} while (t < u);
}

void CUDART_CB callback_fun(void *userData)
{
  //nothing, just something to call.
}

int main()
{
  cudaStream_t stream0, stream1;
  cudaStreamCreateWithFlags(&stream0, cudaStreamNonBlocking);
  cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);

  stall_kern<<<1,1,0,stream0>>>(20000000);
  cudaLaunchHostFunc(stream0, callback_fun, NULL);
  stall_kern<<<1,1,0,stream1>>>(20000000);

  cudaError_t cudaStatus = cudaDeviceReset();
  return 0;
}

striker159 · March 7, 2021, 10:32am

With cuda 11.2, the nsight systems profiler shows both kernels running concurrently on my machine.
Edit: I am using linux

aous72 · March 7, 2021, 10:43am

Thank you striker.

With CUDA11.2, the Nsight Systems Profiler is shown they are run one after the other. What could the reason be?

It is a GTX 1660 Ti compiled for compute_75/sm_75

“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\nvcc.exe” -gencode=arch=compute_75,code="sm_75,compute_75" --use-local-env -ccbin “C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\bin\HostX86\x64” -x cu -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdx64\Release\vc141.pdb /FS /MD " -o x64\Release\simple.cu.cpp.obj “simple.cu.cpp”

striker159 · March 7, 2021, 11:29am

Maybe this is a Windows specific behaviour. I only have access to a Linux machine so I cannot confirm it myself.

aous72 · March 7, 2021, 8:44pm

Thank you striker.

Robert_Crovella · March 8, 2021, 10:38pm

you may be running into wddm command batching. If you make your kernels run long enough and then do a cudaStreamQuery on each stream before the cudaDeviceReset, you may get all the work pushed to the GPU so that it can run concurrently.

aous72 · March 9, 2021, 1:57am

Thank you Robert.

I increased the time 100 folds to ~1 second. Then,
I tried inserting

cudaStreamQuery(stream1),

or creating an event and

cudaEventRecord(event, stream1);
cudaEventQuery(event);

or

cudaStreamSynchronize(stream1);

or a combination of these, before cudaDeviceReset(), but the results is still the same as above.

Thank you.

njuffa · March 9, 2021, 2:10am

I admit I don’t understand the reasoning behind Robert Crovella’s suggestion, but in my experience it is not possible to reliably influence or control the batching mechanism employed by the CUDA driver to work around the significant overhead of the WDDM driver model. Which is why he used the word “may”, I assume.

If this concurrency is absolutely crucial to the use case, I would suggest considering the use of [1] a Linux platform [2] a Windows platform with a TCC-capable GPU.

Robert_Crovella · March 9, 2021, 2:35am

Shooting from the hip. The reasoning is/was that if command batching introduced a “bubble” in the command delivery to the GPU after the first kernel launch but before the 2nd, then the 2nd kernel might not run concurrently with the first. However, in my experience, making the kernel duration longer has a countering effect. In any event, it used to be that such “bubbles” could be flushed in various undocumented or non-standard ways such as cudaStreamQuery. The latest data seems to eliminate this as a likely cause, so the problem may lie elsewhere. You may wish to file a bug.

aous72 · March 9, 2021, 2:40am

Thank you njuffa.

I need to code to run on both Windows and Linux.
This discussion is about finding the most efficient way of running my code.
I think this discussion made it clear to me that I should use event polling, instead of callback.

Thank you.

PS. : this is not directly related to this question.

In the past, I noticed that enqueued instructions might not start executing immediately. Now, I understand this might be due to WDDM batching/overhead. To overcome this delayed execution, I noticed that if I insert cudaEventRecord, then cudaEventQuery, the enqueued instructions would start right away. This supports Robert argument about WDDM. Any delay in execution wastes cycles.
Then I can use EventQuery polling instead of callback.

aous72 · March 9, 2021, 2:43am

Thank you Robert.

I think I will file a bug report.

See the PS section in my reply to njuffa submitted shortly before this.

njuffa · March 9, 2021, 2:45am

My concern with exploiting such observed behavior is that unless this behavior is guaranteed (check the documentation), it makes the solution brittle because without a specification that says so, that behavior could change with any change of the CUDA driver or runtime.

So, simply an implementation artifact or officially defined behavior? I don’t now the answer.

aous72 · March 9, 2021, 2:49am

Thank you njuffa.

I agree, but what can I do if I want the code to run on most GPUs and Windows/Linux. cudaLaunchHostFunc is documented to work in a certain way, and as far as I can tell, it does not.

njuffa · March 9, 2021, 2:57am

As Robert Crovella already said, you would want to report discrepancies between documentation and actual behavior to NVIDIA in the form of a bug report. Of course there are no guarantees in which way the discrepancy will be resolved: either the behavior will be changed to match the current description, or the description will be changed to match the current implementation. I have seen either happen in the past.

In practical terms I’d say: Deliver the best-effort solution for both Windows and Linux (if the Windows version runs a bit slower, c’est la vie), and add a note for the maintainer that the current solution for Windows may need some additional TLC in the future.

Way back when CUDA was still young, I wound up revising some of my code with every major new CUDA version because I had to rely on implementation artifacts for performance reasons. Half a dozen re-writes over as many years.

aous72 · March 9, 2021, 3:00am

I think you summarized the situation well. Thank you.

aous72 · March 9, 2021, 9:33am

The above results were obtained with the latest driver.

Driver version 460.89 seems to produce the correct behaviur.

Robert_Crovella · March 9, 2021, 1:54pm

That would be good information to put into a bug report.

aous72 · March 9, 2021, 8:38pm

Done. Thank you.

aous72 · April 27, 2021, 11:37am

I thought of sharing this outcome with the group.

I filed a bug report and today I got this answer.

"After checking on our local . We think the behavior now is expected . We can see the tasks on other streams are blocked by cudaLaunchHostFunc only when Hardware-accelerated GPU scheduling is off .

A workaround for you is to switch to HW accelerated scheduling, which will remove the serialization: Settings > System > Graphics settings > Hardware-accelerated GPU scheduling On->Reboot

In the future windows releases , HW accelerated scheduling will be default on which will eliminate this issue in a larger extend. For now , we have other implementation in HW accelerated scheduling off mode that changed this behavior to be serializing . But we designed it as expected.
Hope this explains your issue . Please let us know if you still see the issue on HW accelerated scheduling mode . Thank you."

Testing, I can see the expected behavior as suggested in the reply.

Using CUDA 11.3, and 466.11 Driver.
When Hardware-accelerated GPU scheduling is OFF, cudaLaunchHostFunc serializes (blocks work on all streams).
When Hardware-accelerated GPU scheduling is ON, cudaLaunchHostFunc does not block work on other streams (correct behavior).

Cheers.

Topic		Replies	Views
cudaLaunchHostFunc blocking work on Linux CUDA Programming and Performance cuda	2	743	September 22, 2022
cudaLaunchHostFunc API example CUDA Programming and Performance	31	6763	February 8, 2025
culaunchHostFunc overhead latency usage + CPU->GPU signaling CUDA Programming and Performance	6	226	April 1, 2025
cuLaunchHostFunc Questions CUDA Programming and Performance cuda , kernel , nsight	2	676	March 1, 2024
Anyway to tell Cuda run host function launched in different stream in multiple threads(such as a threadpool) CUDA Programming and Performance	2	696	July 11, 2023
cudaLaunchKernel Hangs When Interleaving Multiple Empty Kernels and cudaLaunchHostFunc CUDA Programming and Performance cuda , kernel , a100	2	1917	June 25, 2022
cudaLaunchHostFunc requires cudaStreamSynchronize CUDA Programming and Performance	2	384	January 28, 2024
Launching several kernels on one stream while another kernel running persistently in the background CUDA Programming and Performance	1	757	October 8, 2016
Why does cudaStreamAddCallback serialize kernel execution and break concurrency? CUDA Programming and Performance	12	8265	April 5, 2015
Using multiple streams with multiple host threads takes longer? stream CUDA Programming and Performance	3	962	February 10, 2021

Does cudaLaunchHostFunc block work added to all streams?

Related topics