In the same way that your small prototype worked as desired (i.e. kernels could run in parallel), the cuDNN and CUBLAS based algorithms can run in parallel. Both processes do all cublasCreate or cuDNNCreate calls first (just as both of your prototype did cudaMalloc first) then both processes can run whatever work they wish to do with those handles already created. If you are constantly creating/destroying handles, etc. that is probably a bad pattern, and you should seek to refactor that, just like in an ordinary CUDA code you would seek to reuse device memory allocations, rather than constantly cudaMalloc/cudaFree.
Furthermore, generally witnessing kernel concurrency requires “small” kernels that execute for some reasonable/visible duration, and have low resource utilization, so that 2 of them could actually run concurrently. Likewise the same ideas hold for cuDNN/cublas. If you issue large enough work, you will not witness concurrency in kernels issued by cuDNN or cublas in 2 separate processes, because the GPU does not have enough compute resources to support both running concurrently.
For cublas we can quickly estimate the “size” of work that will saturate a GPU, preventing any ability to witness concurrency. Simply taking whatever matrix or array size you are using, and convert that into threads, and divide that by the thread-carrying capacity of the GPU you are running on. If the number of threads is equal to or greater than the thread carrying-capacity of the GPU you are running on, it’s unrealistic to expect to witness concurrency, whether with ordinary CUDA kernels, or cublas or cuDNN calls.