Does anyone know if cublasSetMatrix() and cublasGetMatrix() are clever enough to copy from/into page-locked host memory at high speed, if I have allocated it already with cudaMallocHost()? Or do they just do “normal” memory copies?
The bandwidth test program shows I get 4X speed-up using page-locked memory so I’d like the BLAS routines to use this feature.
Alex