I’m trying to use CUBLAS SGEMM in my Fortran program. Thunking code gives enormous overhead for data transfer, non-thunking code with cublas_set_matrix etc. is better, but still not good. Thunking code with pinned memory transfer is the worst, surprisingly, it is slower than simple thunking code.
I’m not sure if I understand why do we really need this pinned memory transfer. Can it be somehow used with non-thunking code? And, in principle, what is the best way to organize host-device data transfer in Fortran?