I’m using a persistent kernel to perform some network packet processing task continuously on the GPU, and I would like to run a cuDNN convolution kernel (cudnnConvolutionForward) concurrently with respect to this persistent kernel.
When I try to do so, the application deadlocks. I was told that cuDNN uses a system-wide synchronization, but is not able to launch enough blocks because of my persistent kernel, causing the deadlock.
Is it true that cuDNN uses a device wide synchronization ? If so, is it possible set the number of SM on which the cuDNN kernel will be launched ?
Thanks for any information,