newbie question: cudaMemcpy2DArrayToArray

I’ve run into a compiler error and I am stymied… (I think I might have more luck programming in Mayan… :) )

I’ve declared pointers:

 cudaMalloc ((void **)&Res, 2*NN * sizeof(Res[0]));

cublasAlloc (IcLc, sizeof(float), (void**)&gg);

and then, after hopefully filling Res with data, I am trying to copy Res to gg:

cudaMemcpy2DArrayToArray(gg,0,0,Res,0,0,ND,NK, cudaMemcpyDeviceToDevice);

I get two compiler errors from this line:

"… argument of type “float *” is incompatible with parameter of type “const cudaArray *” "

However, just above I have with similarly declared variables, e.g,:

cudaMemcpy (Arg, VAR, NN * sizeof(VAR[0]), cudaMemcpyHostToDevice);

which the compiler admires.

Can someone please suggest the incantation to fix the cudaMemcpy2DArrayToArray call? I suspect this may be a simple C question; I am a Fortran man in a C world, sigh…

Thx so much,

cudaMemcpy2DArrayToArray((cudaArray *) gg,0,0,(cudaArray *) Res,0,0,ND,NK,cudaMemcpyDeviceToDevice);

seems to pass the compiler muster. I am a ways from running this code to test that it runs correctly…

I’m putting the finishing touches on this particular program, which is working fine at the moment. Its a matlab mex file that is giving me about 4X speed up compared to a quad-core AMD Phenom 9600. (whoohoo).

At one point in the code, I am doing:

cublasGetMatrix (mk, 2*mo, sizeof(float),Res, mk, res, mk);   

cublasSetMatrix (L, I, sizeof(float), res, L, (void*)g, Lc);  

Which copies the mk by 2mo sized array “Res” off of the Device, and then copies that array back to the Device to a padded array “g” , sized Lc by Ic. (LI = mk2mo).

It would be far faster to do something like:



To take advantage of the faster bandwidth of the inter-Device transfers, and only do one transfer.

I’ll be damned if I can get it to work though; what am I missing? - suggestions please?!?


Wait, I think you’ve got the wrong memcpy function. They’re confusingly named though.

cudaArray is a special struct type allocated in Funny Memory, that you bind textures to for instance. Not your regular memory as far as I’ve seen.

What are gg and Res? If they’re just good old float arrays allocated with cudaMalloc, you want this function:


Check sections D.5.11 vs. D.5.15 in the manual, and search the sample projects for the different usage of cudaMemcpy2D vs. cudaMemcpy2DArrayToArray.

Sorry if I’m getting this wrong, hope this fixes the prob lem.

I eventually figured out that I had the wrong memcpy…staring at the documentation long enough (there ought to be a law forbidding C-coders from writing documentation!). I’ve tried this now:

cudaMemcpy2D(gg, Lc*sizeof(float), Res, 2*mo*sizeof(float), L*sizeof(float),

                  I*sizeof(float), cudaMemcpyDeviceToDevice);

And other incarnations (e.g., (void*) in front of the arrays), but still no success. And, yes, gg and Res are ordinary float arrays. I’ll keep poking at it; I am sure the right voodoo will occur to me eventually.

Res is allocated:


cudaMalloc ((void **)&Res, 2*NN * sizeof(Res[0]));

gg is allocated:

cublasAlloc (Ic*Lc, sizeof(float), (void**)&gg);

(Ic is I padded to be a multiple of 32, etc.)

I’ve tried everything I can think of in configurations of cudaMemcpy2D and no luck anywhere. I am baffled and befuddled on how to get this to work. (At this point I suspect it doesn’t work in this case.)

I’ll bug this forum one more time (which is probably once too many) before I give up. I’ve gotten the copy to sort of work with the line:

cudaMemcpy2D(gg, Ic*sizeof(float), Res, I*sizeof(float), 

         I*sizeof(float), L, cudaMemcpyDeviceToDevice);

It only sort of works, however. The values are copied, but the desired zero padding does not occur.

The problem boils down to this:

I have a float* vector on the device (1 2 3 4 5 6 7 8 9)’ that I want to copy into a 5X4 float* array on the device like this:

1 4 7 0

2 5 8 0

3 6 9 0

0 0 0 0

0 0 0 0

(or the equivalent linear representation) Can someone suggest a way to do that? The cublas Get/Set routines seem to do this for me fine, but perhaps they are using CUDA arrays behind the scenes. They also require two copies host<->device, whereas it would be nice to do one copy device<->device.

The cudaMemcpy call above is giving me:

1 6 0 0

2 7 0 0

3 8 0 0

4 9 0 0

5 0 0 0

which does not work for me. (I am using the CUDA BLAS, hence padding arrays so that they are sized in multiples of 32.)

Thanks for your patience…(at the moment is this is not a critical problem, so much as a learning adventure)

If at first you don’t succeed…

And the answer is…(drum roll please)…

 cudaMemcpy2D(gg, Lc*sizeof(float), Res, L*sizeof(float), 

         L*sizeof(float), I, cudaMemcpyDeviceToDevice);    

What I thought were rows were actually columns for Memcpy.

I was aware of differing array ordering vs. rows/columns, but didn’t quite know what that meant until now. cublas has one convention, Memcpy has the other! (I think the documentation says that.)

The effect for me is a 2% speed up in my calculations (and an important lesson…)

Never give up…there is always one last thing to try… :)

Good that it worked out for you, but one comment, here:


Using 0.0 is a very bad idea, it will only cause someone to try 1.0 instead and find out it does not work at all. memset sets every byte to the same value, so that parameter should be a byte value (i.e. 0 instead of 0.0).

In addition this also means that you can not memset float arrays to much else than 0.0, and it also means that some people consider it bad style because it assumes a certain format of the float numbers (though I do not know of a float format in significant use where all-zeros does not represent the number 0.0).

It would be cleaner and probably not much slower to write a kernel to set the whole array to some value, I guess cuda does not yet have something like that?

Staring at the entry for cudaMemset in the reference manual, I see you are right about that. It is a byte-wise setting of this value. (Seems a rather odd utilility; but I am sure the CUDA people know what they are doing.) In other places I was using 0.0f as the value, which byte-wise makes no sense. So, yes, there ought to be a standard utility for setting values to floating point arrays (zeros, ones, whatever); easy enough to implement.

Thanks for pointing that out!