Question about implicit functions with device arrays in CUDA Fortran

Hi,
I’m working on a CUDA Fortran code developed and optimized for GPU clusters. The code uses CPUs for input/output operations and GPUs for the compute-intensive kernels.

I recently implemented some custom subroutines to handle matrix transposition and matrix-matrix multiplication, as I read that intrinsic functions like transpose, matmul, and norm2 are not efficient—or possibly not even usable—when working with device arrays.

I’d like to clarify whether it’s safe and efficient to use these intrinsic functions with device arrays, or if it’s better to always replace them with custom device-aware implementations.

Thank you very much for your help!
Federico

Usually CUDA fortran codes are compiled using nvfortran compiler. In my experience, questions related to CUDA fortran may get better attention by posting on the nvfortran forum here. I can move your question there if desired.

Thank you for the quick reply. It’s my first time posting in this forum, and I’m still figuring out how it works. I’d be grateful if you could move the question to the correct section of the forum. Thanks again.
Federico

OK I see you have already reposted. I’m just going to close this duplicate.