Persistent kernel + cuDNN


I’m using a persistent kernel to perform some network packet processing task continuously on the GPU, and I would like to run a cuDNN convolution kernel (cudnnConvolutionForward) concurrently with respect to this persistent kernel.

When I try to do so, the application deadlocks. I was told that cuDNN uses a system-wide synchronization, but is not able to launch enough blocks because of my persistent kernel, causing the deadlock.

Is it true that cuDNN uses a device wide synchronization ? If so, is it possible set the number of SM on which the cuDNN kernel will be launched ?

Thanks for any information,


Sorry for the delayed response.
Assuming the user is using cudnnForwardConvolution() there are no device-wide synchs in it as best we can tell. We do call cudaMalloc inside of cudnnCreate(), so the user should make sure none of the Implicit Synchronization cases apply.
Please refer to,

Even though I don’t think streams are an issue here, if you’re using an independent persistent kernel, it is probably best to have it on its own CUDA stream.

Thank you.