Why cublasGetMatrix slower than cublasSetMatrix

A simple test but with strange result.
The datatransfer rate of cublasSetMatrix is about 1.5GB/s,which is reansonable.
But the rate of cublasGetMatrix less than half: 0.67GB/s.

It depends on the PCI-e transfer rate: with pageable memory writing to the card is faster than reading.
Try to use pinned memory ( allocate the matrix with cudaHostMalloc, free with cudaHostFree ), with page-locked memory the transfer is more symmetric.