I wouldn’t expect that. The idea of “serialization” contradicts that idea.
There is no specification that I know of, nor any documentation, that provides details.
I will offer up a thought experiment. Suppose, hypothetically, that there was some “guarantee” provided of concurrent execution of host funcs. That would require 1 thread per callback/host func. Given that the number of host funcs launched can be arbitrarily large, how could any system provide that guarantee? It doesn’t make sense.
Furthermore, the CUDA developers have warned that attempts to do synchronization using cudaLaunchHostFunc
is unwise and may lead to trouble. You can find this warning in several places. If you like, start with the documentation of cudaLaunchHostFunc
already linked.
Based on all the information presented here, I conclude that attempting to use cudaLaunchHostFunc
for an activity that has external dependencies is both unwise, and unintended by the CUDA designers. It may lead to trouble.
Given that CUDA provides other methods to declare dependencies between two streams, such as cudaStreamWaitEvent
and cudaGraphs, and perhaps other methods, I would encourage people to consider those.
I won’t be able to answer further questions about undocumented characteristics of the handling of cudaLaunchHostFunc
. I also don’t wish to argue the thought experiment. It’s OK if you disagree with any of my points. We don’t have to agree. The behavior is what it is, regardless of my opinion.
Anyone interested in seeing a change to either CUDA behavior or documentation is encouraged to file a bug.