Thread safety of cuLaunchKernel

Hi CUDA team,

I’m curious whether the CUDA Driver API function cuLaunchKernel is thread safe, specifically when called on the same CUfunction concurrently from multiple (host) threads (in the same CUDA context).

CUfunction is an alias for an opaque type: typedef CUfunc_st * CUfunction, and the first argument to cuLaunchKernel is such a pointer to non-const, which is not a good sign for thread safety. Indeed, this documentation for CUDA 10.1 says “Calling cuLaunchKernel() sets persistent function state that is the same as function state set through the following deprecated APIs: cuFuncSetBlockShape(), cuFuncSetSharedSize(), cuParamSetSize(), cuParamSeti(), cuParamSetf(), cuParamSetv(). When the kernel f is launched via cuLaunchKernel(), the previous block shape, shared size and parameter info associated with f is overwritten.” Thus we’re told that the CUfunction is modified in calls to cuLaunchKernel. Without internal synchronization, we would thus have a data race and not be thread safe. This documentation does not mention thread safety.

However, an old doc from CUDA Toolkit 4 (page 7 of ) says:

“For thread-safety, host threads launching kernels in the same context
concurrently must use the new thread-safe stateless launch API function
cuLaunchKernel(), which takes the place of the more verbose earlier API (i.e.,
cuParamSet*() + cuFuncSetBlockShape() + cuFuncSetSharedSize() + cuLaunchGrid()).
Note that with this new API, kernel grid launches in the CUDA Driver API more
closely resemble kernel launches via the <<<>>> syntax of the CUDA Runtime API.”

This seems to contradict the other doc by saying cuLaunchKernel is stateless (though it is for an older version of CUDA).

So what’s the deal here? Is cuLaunchKernel stateless or stateful? Is it safe to call concurrently from multiple host threads on the same CUfunction (in the same CUDA context)? If it is thread safe and stateful, how does it achieve this?

Thomas Peters

cuLaunchKernel should be thread safe (regardless of the specific CUfunction, same or different, regardless of same or different context). The description in the API manual is intended to convey the idea that if you had previously set any of the launch configuration parameters, they would be overwritten on that call.

I don’t have details on the implementation inside the driver API library, but if there were synchronization needed (amongst threads), the driver API library is certainly able to manage that.

If you’re looking for documentation that supports this, my suggestion would be to file a bug using the instructions linked in a sticky post at the top of this forum.

The base CUDA APIs (driver and runtime) were made thread-safe some time ago. Specific contrary observations should be reported. The one specific exception that I’m aware of at this time is the CUDA graph APIs, and their lack of thread-safe-ness is called out in the documentation. (I’m also ignoring deprecated functionality.)

1 Like

Thank you for your reply, Robert. Cheers