Efficiently performing memory transactions in Matlab-like CPU->GPU assignments

I have implemented a Matrix library based on expression templates for both, device and host operations. I have implemented a Matlab-like syntax, so that I can execute CPU to GPU assignments of the form

CudaMatrix<>    testGPU(len1,len2);
Matrix<>        testCPU(1,len3);

test(Range(a1,a2,a3),Range(b1,b2,b3))=testCPU;

which is equivalent to Matlab’s

test(a1:a2:a3,b1:b2:b3)=testCPU;

At present, I have implemented the transfer in the following way

CudaExpr<A,B> operator=(const Matrix<B> &ob) {
    CudaExpr<A,B> e = *this;
    CudaMatrix<B> temp(ob);
    evaluation_submatrix_function(a_,temp.data_,GetNumElements()); 
    return *this; }

In other words, I move the content of the CPU Matrix to the temporary GPU object temp and then assign temp to the submatrix test(a1:a2:a3,b1:b2:b3) by a kernel call.

Is there any way to more efficiently perform this memory transaction? Does CUDA have commands to copy from CPU to non-consecutive GPU locations and, in particular, from a1 to a3 every a2 locations along the rows and from b1 to b3 every b2 locations along the columns? Thank you very much in advance.

For the copying of submatrices you would want to look at cudaMemcpy2D().

Thank you very much, njuffa, for your suggestion. I have better explained my issue at

http://stackoverflow.com/questions/16575588/efficiently-performing-memory-transactions-in-matlab-like-cpu-gpu-assignments

My issue is that I need a stride for the rows and a stride for the columns, but, from my understanding of both, cudaMemcpy2D() and cudaMemcpy3D(), they have only one stride value. Am I right? Thanks.

I have never used cudaMemcpy3D() before. Looking at the documentation right now, I don’t think I understand how it works. As you say, there is only a single pitch value. I don’t see how this would allow copying of an arbitrary 3D sub-volume from inside a larger 3D volume, as this would require two pitch values in my thinking (i.e. plane pitch, row pitch). If that can actually be done, this should allow your desired copy to work (the spacing between rows in your 2D matrix would be equivalent to the plane pitch, and the spacing between elements in the rows of your 2D matrix would be equivalent to the row pitch).

Either my thinking is wrong, or cudaMemcpy3D() is restricted in what it can do and cannot copy an arbitrary 3D sub-volume from inside an enclosing 3D volume. Unfortuntaley I don’t have time right now to play around with cudaMemcpy3D() to understand it better.

Thank you again for your help.

I must say that, reading the documentation, it is not much clear also to me how cudaMemcpy3D works. I suspect that the rationale behind cudaMemcpy2D and cudaMemcpy3D is to extract N-1 dimensional patterns from N dimensional arrays. For example, rows in a 2D matrix and slices in a 3D matrix. But I’m not sure and you have certainly much more experience than me to check whether this is right in the 2D case.

Although I have a workaround, it would be anyway good for me to understand and possibly implement and compare the performance against my current solution. If you could, in the future, recommend directions to explore or any further suggestion, please let me know :-)

Again, thanks.

The purpose of cudaMemcpy2D() as designed is to be able to copy arbitrary 2D sub-matrices from a larger encompassing 2D matrix. These matrices could be images, for example, or floating-point matrices in linear algebra. In the early days of CUBLAS (of which I was the inital developer) I was looking for fast ways to support strided vectors and noticed that a 2D copy can be used nicely to perform a strided 1D copy. I very much doubt that I was the first person to notice that.

For now you can use one cudaMemcpy2D() per strided row of your sub matrix and iterate across these rows, meaning you will have as many call to cudaMemcpu2D() as you have rows. You could also use CUBLAS’s cublasSetVector() and cublasGetVector() to copy each strided row.

As I said, my experience with cudaMemcpy3D() is nil, as that API did not exist when I was actively working on linear algebra stuff. I may come back to the question in the future after familiarizing myself with cudaMemcpy3D(), if time allows.