I have implemented a Matrix library based on expression templates for both, device and host operations. I have implemented a Matlab-like syntax, so that I can execute CPU to GPU assignments of the form

```
CudaMatrix<> testGPU(len1,len2);
Matrix<> testCPU(1,len3);
test(Range(a1,a2,a3),Range(b1,b2,b3))=testCPU;
```

which is equivalent to Matlab’s

```
test(a1:a2:a3,b1:b2:b3)=testCPU;
```

At present, I have implemented the transfer in the following way

```
CudaExpr<A,B> operator=(const Matrix<B> &ob) {
CudaExpr<A,B> e = *this;
CudaMatrix<B> temp(ob);
evaluation_submatrix_function(a_,temp.data_,GetNumElements());
return *this; }
```

In other words, I move the content of the CPU Matrix to the temporary GPU object temp and then assign temp to the submatrix test(a1:a2:a3,b1:b2:b3) by a kernel call.

Is there any way to more efficiently perform this memory transaction? Does CUDA have commands to copy from CPU to non-consecutive GPU locations and, in particular, from a1 to a3 every a2 locations along the rows and from b1 to b3 every b2 locations along the columns? Thank you very much in advance.