Hi,
I’m working on a CUDA Fortran code developed and optimized for GPU clusters. The code uses CPUs for input/output operations and GPUs for the compute-intensive kernels.
I recently implemented some custom subroutines to handle matrix transposition and matrix-matrix multiplication, as I read that intrinsic functions like transpose
, matmul
, and norm2
are not efficient—or possibly not even usable—when working with device arrays.
I’d like to clarify whether it’s safe and efficient to use these intrinsic functions with device arrays, or if it’s better to always replace them with custom device-aware implementations.
Thank you very much for your help!
Federico