Does cudaLaunchHostFunc block work added to all streams?

In the following program, I have created two non-blocking streams, stream0 and stream1.
I noticed that enqueuing cudaLaunchHostFunc in stream0 will block newly added work in the other stream (i.e., stream1). This can be checked from the timeline produced by the Visual Profiler.

Making the cudaLaunchHostFunc call after stall_kern for stream1, allows both streams to run concurrently.

  1. Is this the expected behavior, or am I missing something?
  2. This behavior is not very useful; one would expect that streams operate independently unless there is something that ties them together such as an event.

Thank you.

Screenshot 2021-03-07 115839

#include "cuda_runtime.h"
#include <cstdio>

__global__ void stall_kern(int time_diff)
{
  const long long int u = clock64() + time_diff;
  long long int t;
	do {
		t = clock64();
	} while (t < u);
}

void CUDART_CB callback_fun(void *userData)
{
  //nothing, just something to call.
}

int main()
{
  cudaStream_t stream0, stream1;
  cudaStreamCreateWithFlags(&stream0, cudaStreamNonBlocking);
  cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);

  stall_kern<<<1,1,0,stream0>>>(20000000);
  cudaLaunchHostFunc(stream0, callback_fun, NULL);
  stall_kern<<<1,1,0,stream1>>>(20000000);

  cudaError_t cudaStatus = cudaDeviceReset();
  return 0;
}

With cuda 11.2, the nsight systems profiler shows both kernels running concurrently on my machine.
Edit: I am using linux

Thank you striker.

With CUDA11.2, the Nsight Systems Profiler is shown they are run one after the other. What could the reason be?

It is a GTX 1660 Ti compiled for compute_75/sm_75

“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\nvcc.exe” -gencode=arch=compute_75,code="sm_75,compute_75" --use-local-env -ccbin “C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\bin\HostX86\x64” -x cu -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdx64\Release\vc141.pdb /FS /MD " -o x64\Release\simple.cu.cpp.obj “simple.cu.cpp”

Maybe this is a Windows specific behaviour. I only have access to a Linux machine so I cannot confirm it myself.

Thank you striker.

you may be running into wddm command batching. If you make your kernels run long enough and then do a cudaStreamQuery on each stream before the cudaDeviceReset, you may get all the work pushed to the GPU so that it can run concurrently.

Thank you Robert.

I increased the time 100 folds to ~1 second. Then,
I tried inserting

cudaStreamQuery(stream1),

or creating an event and

cudaEventRecord(event, stream1);
cudaEventQuery(event);

or

cudaStreamSynchronize(stream1);

or a combination of these, before cudaDeviceReset(), but the results is still the same as above.

Thank you.

I admit I don’t understand the reasoning behind Robert Crovella’s suggestion, but in my experience it is not possible to reliably influence or control the batching mechanism employed by the CUDA driver to work around the significant overhead of the WDDM driver model. Which is why he used the word “may”, I assume.

If this concurrency is absolutely crucial to the use case, I would suggest considering the use of [1] a Linux platform [2] a Windows platform with a TCC-capable GPU.

Shooting from the hip. The reasoning is/was that if command batching introduced a “bubble” in the command delivery to the GPU after the first kernel launch but before the 2nd, then the 2nd kernel might not run concurrently with the first. However, in my experience, making the kernel duration longer has a countering effect. In any event, it used to be that such “bubbles” could be flushed in various undocumented or non-standard ways such as cudaStreamQuery. The latest data seems to eliminate this as a likely cause, so the problem may lie elsewhere. You may wish to file a bug.

Thank you njuffa.

I need to code to run on both Windows and Linux.
This discussion is about finding the most efficient way of running my code.
I think this discussion made it clear to me that I should use event polling, instead of callback.

Thank you.

PS. : this is not directly related to this question.

In the past, I noticed that enqueued instructions might not start executing immediately. Now, I understand this might be due to WDDM batching/overhead. To overcome this delayed execution, I noticed that if I insert cudaEventRecord, then cudaEventQuery, the enqueued instructions would start right away. This supports Robert argument about WDDM. Any delay in execution wastes cycles.
Then I can use EventQuery polling instead of callback.

Thank you Robert.

I think I will file a bug report.

See the PS section in my reply to njuffa submitted shortly before this.

My concern with exploiting such observed behavior is that unless this behavior is guaranteed (check the documentation), it makes the solution brittle because without a specification that says so, that behavior could change with any change of the CUDA driver or runtime.

So, simply an implementation artifact or officially defined behavior? I don’t now the answer.

Thank you njuffa.

I agree, but what can I do if I want the code to run on most GPUs and Windows/Linux. cudaLaunchHostFunc is documented to work in a certain way, and as far as I can tell, it does not.

As Robert Crovella already said, you would want to report discrepancies between documentation and actual behavior to NVIDIA in the form of a bug report. Of course there are no guarantees in which way the discrepancy will be resolved: either the behavior will be changed to match the current description, or the description will be changed to match the current implementation. I have seen either happen in the past.

In practical terms I’d say: Deliver the best-effort solution for both Windows and Linux (if the Windows version runs a bit slower, c’est la vie), and add a note for the maintainer that the current solution for Windows may need some additional TLC in the future.

Way back when CUDA was still young, I wound up revising some of my code with every major new CUDA version because I had to rely on implementation artifacts for performance reasons. Half a dozen re-writes over as many years.

1 Like

I think you summarized the situation well. Thank you.

The above results were obtained with the latest driver.

Driver version 460.89 seems to produce the correct behaviur.

That would be good information to put into a bug report.

Done. Thank you.

I thought of sharing this outcome with the group.

I filed a bug report and today I got this answer.

"After checking on our local . We think the behavior now is expected . We can see the tasks on other streams are blocked by cudaLaunchHostFunc only when Hardware-accelerated GPU scheduling is off .

A workaround for you is to switch to HW accelerated scheduling, which will remove the serialization: Settings > System > Graphics settings > Hardware-accelerated GPU scheduling On->Reboot

In the future windows releases , HW accelerated scheduling will be default on which will eliminate this issue in a larger extend. For now , we have other implementation in HW accelerated scheduling off mode that changed this behavior to be serializing . But we designed it as expected.
Hope this explains your issue . Please let us know if you still see the issue on HW accelerated scheduling mode . Thank you."

Testing, I can see the expected behavior as suggested in the reply.

Using CUDA 11.3, and 466.11 Driver.
When Hardware-accelerated GPU scheduling is OFF, cudaLaunchHostFunc serializes (blocks work on all streams).
When Hardware-accelerated GPU scheduling is ON, cudaLaunchHostFunc does not block work on other streams (correct behavior).

Cheers.