You should be able to write GPU code (i.e. kernels) that runs in comparable speed on either CUDA C, CUDA C++, or CUDA Fortran. CUDA Fortran is a translation system that ultimately produces a CUDA C/C++ implementation (of the GPU code), that can be called from Fortran, but compiles using the CUDA C/C++ compiler. Python will depend on the specific CUDA/Python implementation. PyCUDA is essentially a wrapper environment that allows C/C++ kernels or CUDA libraries to be called from Python. In that case, an equivalent kernel should have the same performance whether it is called from PyCUDA or from CUDA C/C++ (or CUDA Fortran, if the CUDA Fortran-generated “translated” kernel is equivalent to how you would write the same function in CUDA C/C++).
The net of it is, all of the systems I describe ultimately use the CUDA C/C++ compiler and framework, at least for the CUDA kernel portions of the code, so you should not expect any inherent differences based on that.
For the portions of your code that run on the host CPU even in the CPU/GPU implementation, there will be language differences, and I’m not addressing those.