cudaLaunchHostFunc API example

There is no cudaLaunchHostFunc example on Google.

Does anybody know the cudaLaunchHostFunc ?

host ​cudaError_t cudaLaunchHostFunc ( cudaStream_t stream, cudaHostFn_t fn, void* userData )’

In that API, I don’t understand ‘cudaHostFn_t fn’ and ‘void* userData’.

fn is host function name. but where is the argument area…? Is argument area userData?

The function is described here:

[url]https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g05841eaa5f90f27124241baafb3e856f[/url]

it is effectively what is underpinning the CUDA stream callback functionality. Therefore rather than try to use this launch mechanism directly, I would recommend that you use stream callbacks as defined in the programming guide:

[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-callbacks[/url]

api manual:

[url]https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g74aa9f4b1c2f12d994bf13876a5a2498[/url]

and also there is a cuda sample code that gives an example:

[url]https://docs.nvidia.com/cuda/cuda-samples/index.html#simple-cuda-callbacks[/url]

Thank you so much Robert.

I will check the API!

Hello, Mr. Robert.

I read the API that your link.

but I am confuing the parameters.

host ​cudaError_t cudaStreamAddCallback ( cudaStream_t stream, cudaStreamCallback_t callback, void* userData, unsigned int  flags )

In that API, what datatype are ‘cudaStreamCallback_t callback’ and ‘void* userData’?

That means, is ‘cudaStreamCallback_t callback’ a host function?
is ‘void* userData’ a host function’s parameter?

if ‘void* userData’ is a host function’s parameter, how to give the parameters?

for example, myCallback(int a, float b, …)

Thank you so much Mr. Robert!

why not look at the sample code I pointed you to?

Sorry… I forgot looking the sample code.

But something difficult to me.

there are no sending multiple parameters.

Thank you much Mr. Robert.

pass a pointer to struct:

[url]https://devtalk.nvidia.com/default/topic/1044166/cuda-programming-and-performance/running-streams-parallel-with-the-host-functions/[/url]

Oh! userData could be a pointer of structure! Thank you!

Just a question about cudaLaunchHostFunc that your links didn’t answer. It’s quite clear that the CUDA runtime is using its own CPU thread in order to execute the Host function. But, using NVTX and tagging the inside of the Host function, I can see how the executions (CPU) are serialized even if they are enqueued in different CUDA streams:

  • Are Host functions enqueued on different CUDA streams, asynchronous with respect to each other?
  • Does the CUDA runtime have a CPU threadpool to do so? Or it uses something like std::async?
  • If it does have it’s own threadpool, is this threadpool always initialized? Or only when you call cudaAddCallback and cudaLaunchHostFunc??
  • How many threads does the threadpool spawn and depending on what?

Thanks!

1 Like

Same question here. My experimenting shows that callbacks from multiple streams are serialized, while one would think that host function launches from multiple streams would execute asynchronously.

That is not even true for compute kernels. Two operations from different streams may overlap, if there are enough resources. If a kernel fully occupies the GPU, no other kernel can run at the same time on the same device.

It is the same for host callbacks. If the available resources (one callback thread per device) are fully utilized, i.e. the callback thread currently processes a callback, no other callback can be processed.

1 Like

In the case of the CPU, you may want to have a CPU thread per stream, in order to execute callbacks. That would lead to faster executions when you have multicore CPU’s.

The sinchronization can be handled from an stream perspective, the sinchronization enforced by the streams and events, tells the runtime when a callback can be safely executed, even if you have a CPU thread per stream.

So I was surprised that there is only one thread per device when executing callbacks. That severely limits the usefulness of callbacks in our use case.

The function documentation states that function handling may be serialized:

Host functions without a mandated order (such as in independent streams) execute in undefined order and may be serialized .

Beyond that, I’m not aware of any detailed description or specification for the thread or threads that may be spun up by the CUDA runtime for host func handling.

Host functions are enqueued on the stream indicated in the function call, and stream semantics apply.

1 Like

Hi!

I think I did not explain my self correctly, let me rephrase.

For a given set of callbacks, each one being enqueued in different streams by the programmer, should we expect those callbacks to be asynchronous between them and executed in parallel?

As per the documentation shared, I understand that the callbaks “may be seriallized”, therefore there is no guarantee on parallelism.

According to another answer in this thread, there is a single CPU thread in the CUDA runtime, responsible for executing all the callbacks, from all the streams. Looking at the documentation, I’m guessing that this may be or may not be like this, depending on the runtime version. Could you confirm this @Robert_Crovella?

Thanks!

1 Like

I wouldn’t expect that. The idea of “serialization” contradicts that idea.

There is no specification that I know of, nor any documentation, that provides details.

I will offer up a thought experiment. Suppose, hypothetically, that there was some “guarantee” provided of concurrent execution of host funcs. That would require 1 thread per callback/host func. Given that the number of host funcs launched can be arbitrarily large, how could any system provide that guarantee? It doesn’t make sense.

Furthermore, the CUDA developers have warned that attempts to do synchronization using cudaLaunchHostFunc is unwise and may lead to trouble. You can find this warning in several places. If you like, start with the documentation of cudaLaunchHostFunc already linked.

Based on all the information presented here, I conclude that attempting to use cudaLaunchHostFunc for an activity that has external dependencies is both unwise, and unintended by the CUDA designers. It may lead to trouble.

Given that CUDA provides other methods to declare dependencies between two streams, such as cudaStreamWaitEvent and cudaGraphs, and perhaps other methods, I would encourage people to consider those.

I won’t be able to answer further questions about undocumented characteristics of the handling of cudaLaunchHostFunc. I also don’t wish to argue the thought experiment. It’s OK if you disagree with any of my points. We don’t have to agree. The behavior is what it is, regardless of my opinion.

Anyone interested in seeing a change to either CUDA behavior or documentation is encouraged to file a bug.

1 Like

Thanks @Robert_Crovella

No, we don’t use cudaLaunchHost functions for synchronization, nor for CPU work that has any external dependency.

Whe use it only for doing CPU tasks in a different CPU thread than the one enqueueing work in the stream, so that both can work in parallel, and make the software faster.

Regarding the number of CPU threads, you can always use a thread pool, with number of threads limited to the number of CPU cores in the system.

No problem, we’ll keep making our own threads to make the execution faster. It would have been more convenient for us to be able to use Host Functions instead. We will consider the bug filing.

Thanks again!

Hello Mr Robert,

In our small example, we have two streams that individually call the launch_host on a call back that is a 1 second sleep. As the stream waits for the completion to continue we have the first stream waiting for one second and the second stream waiting for two seconds. Given we run this on a large server, I don’t understand why the host functions are not executed in parallel.

The example on cuda-samples is good but it manually launches a new thread for work, this has the undesirable effect of not locking the stream but allows us to run the host code in parallel.

The solution we might use is to a CPU side thread sync, before re-scheduling commands to the cuda api, but it feels more like a work around. Do you have any comments around this or potentially how we could diagnose with the host functions are not executing in parallel.

If you want to imagine or understand how it could be possible that two separate host function launches are not executed in parallel, consider the possibility that only a single CPU thread is used by the CUDA runtime to run any and all host functions.

To me that possibility is self-evident from the “may be serialized” statement in the CUDA documentation and already discussed in this thread.

I don’t have any further explanation of the underlying mechanism, which is not in detail specified by CUDA.