faster CUBLAS in Fortran what is the best way to deal with data transfer times?

I’m trying to use CUBLAS SGEMM in my Fortran program. Thunking code gives enormous overhead for data transfer, non-thunking code with cublas_set_matrix etc. is better, but still not good. Thunking code with pinned memory transfer is the worst, surprisingly, it is slower than simple thunking code.
I’m not sure if I understand why do we really need this pinned memory transfer. Can it be somehow used with non-thunking code? And, in principle, what is the best way to organize host-device data transfer in Fortran?