cudaLaunchHostFunc API example

cudaMancpy · December 3, 2018, 2:07am

There is no cudaLaunchHostFunc example on Google.

Does anybody know the cudaLaunchHostFunc ?

‘host cudaError_t cudaLaunchHostFunc ( cudaStream_t stream, cudaHostFn_t fn, void* userData )’

In that API, I don’t understand ‘cudaHostFn_t fn’ and ‘void* userData’.

fn is host function name. but where is the argument area…? Is argument area userData?

Robert_Crovella · December 3, 2018, 3:25am

The function is described here:

[url]https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g05841eaa5f90f27124241baafb3e856f[/url]

it is effectively what is underpinning the CUDA stream callback functionality. Therefore rather than try to use this launch mechanism directly, I would recommend that you use stream callbacks as defined in the programming guide:

[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-callbacks[/url]

api manual:

[url]https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g74aa9f4b1c2f12d994bf13876a5a2498[/url]

and also there is a cuda sample code that gives an example:

[url]https://docs.nvidia.com/cuda/cuda-samples/index.html#simple-cuda-callbacks[/url]

cudaMancpy · December 3, 2018, 4:15am

Thank you so much Robert.

I will check the API!

cudaMancpy · December 4, 2018, 5:35am

Hello, Mr. Robert.

I read the API that your link.

but I am confuing the parameters.

host cudaError_t cudaStreamAddCallback ( cudaStream_t stream, cudaStreamCallback_t callback, void* userData, unsigned int flags )

In that API, what datatype are ‘cudaStreamCallback_t callback’ and ‘void* userData’?

That means, is ‘cudaStreamCallback_t callback’ a host function?
is ‘void* userData’ a host function’s parameter?

if ‘void* userData’ is a host function’s parameter, how to give the parameters?

for example, myCallback(int a, float b, …)

Thank you so much Mr. Robert!

Robert_Crovella · December 4, 2018, 1:59pm

why not look at the sample code I pointed you to?

cudaMancpy · December 4, 2018, 3:46pm

Sorry… I forgot looking the sample code.

But something difficult to me.

there are no sending multiple parameters.

Thank you much Mr. Robert.

Robert_Crovella · December 4, 2018, 5:02pm

pass a pointer to struct:

[url]https://devtalk.nvidia.com/default/topic/1044166/cuda-programming-and-performance/running-streams-parallel-with-the-host-functions/[/url]

cudaMancpy · December 5, 2018, 1:15am

Oh! userData could be a pointer of structure! Thank you!

oamoros0ealf · March 24, 2020, 6:12pm

Just a question about cudaLaunchHostFunc that your links didn’t answer. It’s quite clear that the CUDA runtime is using its own CPU thread in order to execute the Host function. But, using NVTX and tagging the inside of the Host function, I can see how the executions (CPU) are serialized even if they are enqueued in different CUDA streams:

Are Host functions enqueued on different CUDA streams, asynchronous with respect to each other?
Does the CUDA runtime have a CPU threadpool to do so? Or it uses something like std::async?
If it does have it’s own threadpool, is this threadpool always initialized? Or only when you call cudaAddCallback and cudaLaunchHostFunc??
How many threads does the threadpool spawn and depending on what?

Thanks!

sergeyn · October 30, 2022, 4:18pm

Same question here. My experimenting shows that callbacks from multiple streams are serialized, while one would think that host function launches from multiple streams would execute asynchronously.

striker159 · October 30, 2022, 6:42pm

That is not even true for compute kernels. Two operations from different streams may overlap, if there are enough resources. If a kernel fully occupies the GPU, no other kernel can run at the same time on the same device.

It is the same for host callbacks. If the available resources (one callback thread per device) are fully utilized, i.e. the callback thread currently processes a callback, no other callback can be processed.

oamoros0ealf · March 17, 2023, 9:39am

In the case of the CPU, you may want to have a CPU thread per stream, in order to execute callbacks. That would lead to faster executions when you have multicore CPU’s.

The sinchronization can be handled from an stream perspective, the sinchronization enforced by the streams and events, tells the runtime when a callback can be safely executed, even if you have a CPU thread per stream.

So I was surprised that there is only one thread per device when executing callbacks. That severely limits the usefulness of callbacks in our use case.

Robert_Crovella · March 17, 2023, 5:16pm

The function documentation states that function handling may be serialized:

Host functions without a mandated order (such as in independent streams) execute in undefined order and may be serialized .

Beyond that, I’m not aware of any detailed description or specification for the thread or threads that may be spun up by the CUDA runtime for host func handling.

Host functions are enqueued on the stream indicated in the function call, and stream semantics apply.

oamoros0ealf · March 18, 2023, 9:44am

Hi!

I think I did not explain my self correctly, let me rephrase.

For a given set of callbacks, each one being enqueued in different streams by the programmer, should we expect those callbacks to be asynchronous between them and executed in parallel?

As per the documentation shared, I understand that the callbaks “may be seriallized”, therefore there is no guarantee on parallelism.

According to another answer in this thread, there is a single CPU thread in the CUDA runtime, responsible for executing all the callbacks, from all the streams. Looking at the documentation, I’m guessing that this may be or may not be like this, depending on the runtime version. Could you confirm this @Robert_Crovella?

Thanks!

Robert_Crovella · March 18, 2023, 2:34pm

I wouldn’t expect that. The idea of “serialization” contradicts that idea.

There is no specification that I know of, nor any documentation, that provides details.

I will offer up a thought experiment. Suppose, hypothetically, that there was some “guarantee” provided of concurrent execution of host funcs. That would require 1 thread per callback/host func. Given that the number of host funcs launched can be arbitrarily large, how could any system provide that guarantee? It doesn’t make sense.

Furthermore, the CUDA developers have warned that attempts to do synchronization using cudaLaunchHostFunc is unwise and may lead to trouble. You can find this warning in several places. If you like, start with the documentation of cudaLaunchHostFunc already linked.

Based on all the information presented here, I conclude that attempting to use cudaLaunchHostFunc for an activity that has external dependencies is both unwise, and unintended by the CUDA designers. It may lead to trouble.

Given that CUDA provides other methods to declare dependencies between two streams, such as cudaStreamWaitEvent and cudaGraphs, and perhaps other methods, I would encourage people to consider those.

I won’t be able to answer further questions about undocumented characteristics of the handling of cudaLaunchHostFunc. I also don’t wish to argue the thought experiment. It’s OK if you disagree with any of my points. We don’t have to agree. The behavior is what it is, regardless of my opinion.

Anyone interested in seeing a change to either CUDA behavior or documentation is encouraged to file a bug.

oamoros0ealf · March 18, 2023, 7:58pm

Thanks @Robert_Crovella

No, we don’t use cudaLaunchHost functions for synchronization, nor for CPU work that has any external dependency.

Whe use it only for doing CPU tasks in a different CPU thread than the one enqueueing work in the stream, so that both can work in parallel, and make the software faster.

Regarding the number of CPU threads, you can always use a thread pool, with number of threads limited to the number of CPU cores in the system.

No problem, we’ll keep making our own threads to make the execution faster. It would have been more convenient for us to be able to use Host Functions instead. We will consider the bug filing.

Thanks again!

guilhermehartmann · July 11, 2023, 1:34pm

Hello Mr Robert,

In our small example, we have two streams that individually call the launch_host on a call back that is a 1 second sleep. As the stream waits for the completion to continue we have the first stream waiting for one second and the second stream waiting for two seconds. Given we run this on a large server, I don’t understand why the host functions are not executed in parallel.

The example on cuda-samples is good but it manually launches a new thread for work, this has the undesirable effect of not locking the stream but allows us to run the host code in parallel.

The solution we might use is to a CPU side thread sync, before re-scheduling commands to the cuda api, but it feels more like a work around. Do you have any comments around this or potentially how we could diagnose with the host functions are not executing in parallel.

Robert_Crovella · July 11, 2023, 3:46pm

If you want to imagine or understand how it could be possible that two separate host function launches are not executed in parallel, consider the possibility that only a single CPU thread is used by the CUDA runtime to run any and all host functions.

To me that possibility is self-evident from the “may be serialized” statement in the CUDA documentation and already discussed in this thread.

I don’t have any further explanation of the underlying mechanism, which is not in detail specified by CUDA.

byte.xiaobin · February 5, 2025, 7:27am

I write an example.

void MyCallback(void *data) {
    int *a = (int*)(data);
    cout << *a << endl;
}
int func() {
    int a = 10;
    int ret = cudaLaunchHostFunc(NULL, MyCallback, &a);
    cout << ret <<endl;
    return ret;
}
int main() {
    int ret = func();
    cout << ret <<endl;
    return ret;
}

The main function will not wait callback function which will lead to point data in call back function will be error. Is there any solution?

striker159 · February 5, 2025, 7:56am

You need to synchronize before returning from func else the pointer to a will be invalid.
Alternatively, allocate a on the heap, deallocate it in the callback, and synchronize before returning from main

Topic		Replies	Views
Why does cudaStreamAddCallback serialize kernel execution and break concurrency? CUDA Programming and Performance	12	8101	April 5, 2015
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23225	October 7, 2010
Computation and PCIe tranfers overlaping with callbacks and events. CUDA Programming and Performance	7	915	July 7, 2016
Fail to sync the cudaMemcpyAsync using the cudaEvent in two streams CUDA Programming and Performance	4	255	April 1, 2024
cudaMemcpyAsync CUDA Programming and Performance	10	21103	October 16, 2015
Using cudaEvents to synchronise with cudaStreamCallback CUDA Programming and Performance cuda	5	712	May 9, 2024
CUDA streams and error handling CUDA Programming and Performance	11	2679	December 14, 2023
Does cudaLaunchHostFunc block work added to all streams? CUDA Programming and Performance	19	1447	October 12, 2021
Is there any way to launch a graph from the HOST node? CUDA Programming and Performance	15	615	January 8, 2024
cudaGetLastError. Which kernel execution raised it? CUDA Programming and Performance	10	3537	March 8, 2019

cudaLaunchHostFunc API example

Related topics