cublasGetMatrix alternatives

I’m coding a class to wrap an order 3 tensor, and am struggling with a fast and convenient way to copy them to & from devices.

If this were an order 2 tensor (i.e a matrix), there are cublas routines like cublasGetMatrix / cublasSetMatrix to do so. I’m wondering if anyone has suggestions for ways to do low (or hidden) latency transfer of 3 dimensional oriented memory to and from device.

Thanks,