Deadlock on cuda kernel launch

It’s an example of lazy loading. It’s a documented mechanism, and you have some options for workarounds. A likely workaround would be to follow the suggestion for emulating eager loading with cudaFuncGetAttributes(). (Also, kernel-to-kernel communication, requiring concurrency, is a frowned-on design practice, as CUDA does not guarantee concurrency.)