newbie question: cudaMemcpy2DArrayToArray

Boxed_Cylon · June 19, 2008, 8:54pm

I’ve run into a compiler error and I am stymied… (I think I might have more luck programming in Mayan… :) )

I’ve declared pointers:

 cudaMalloc ((void **)&Res, 2*NN * sizeof(Res[0]));

and
cublasAlloc (IcLc, sizeof(float), (void**)&gg);
cudaMemset(gg,0.0,IcLc*4);

and then, after hopefully filling Res with data, I am trying to copy Res to gg:

cudaMemcpy2DArrayToArray(gg,0,0,Res,0,0,ND,NK, cudaMemcpyDeviceToDevice);

I get two compiler errors from this line:

"… argument of type “float *” is incompatible with parameter of type “const cudaArray *” "

However, just above I have with similarly declared variables, e.g,:

cudaMemcpy (Arg, VAR, NN * sizeof(VAR[0]), cudaMemcpyHostToDevice);

which the compiler admires.

Can someone please suggest the incantation to fix the cudaMemcpy2DArrayToArray call? I suspect this may be a simple C question; I am a Fortran man in a C world, sigh…

Thx so much,
B.-C.

Boxed_Cylon · June 19, 2008, 9:15pm

cudaMemcpy2DArrayToArray((cudaArray *) gg,0,0,(cudaArray *) Res,0,0,ND,NK,cudaMemcpyDeviceToDevice);

seems to pass the compiler muster. I am a ways from running this code to test that it runs correctly…

Boxed_Cylon · June 21, 2008, 9:29am

I’m putting the finishing touches on this particular program, which is working fine at the moment. Its a matlab mex file that is giving me about 4X speed up compared to a quad-core AMD Phenom 9600. (whoohoo).

At one point in the code, I am doing:

cublasGetMatrix (mk, 2*mo, sizeof(float),Res, mk, res, mk);   

cublasSetMatrix (L, I, sizeof(float), res, L, (void*)g, Lc);

Which copies the mk by 2mo sized array “Res” off of the Device, and then copies that array back to the Device to a padded array “g” , sized Lc by Ic. (LI = mk2mo).

It would be far faster to do something like:

cudaMemcpy2DArrayToArray((cudaArray*)g,1,1,(cudaArray*)Res,1,1,

                    L,I*sizeof(float),cudaMemcpyDeviceToDevice);

To take advantage of the faster bandwidth of the inter-Device transfers, and only do one transfer.

I’ll be damned if I can get it to work though; what am I missing? - suggestions please?!?

Thx!

kristleifur · June 22, 2008, 11:57am

Wait, I think you’ve got the wrong memcpy function. They’re confusingly named though.

cudaArray is a special struct type allocated in Funny Memory, that you bind textures to for instance. Not your regular memory as far as I’ve seen.

What are gg and Res? If they’re just good old float arrays allocated with cudaMalloc, you want this function:

cudaMemcpy2D()

Check sections D.5.11 vs. D.5.15 in the manual, and search the sample projects for the different usage of cudaMemcpy2D vs. cudaMemcpy2DArrayToArray.

Sorry if I’m getting this wrong, hope this fixes the prob lem.

Boxed_Cylon · June 22, 2008, 7:32pm

I eventually figured out that I had the wrong memcpy…staring at the documentation long enough (there ought to be a law forbidding C-coders from writing documentation!). I’ve tried this now:

cudaMemcpy2D(gg, Lc*sizeof(float), Res, 2*mo*sizeof(float), L*sizeof(float),

                  I*sizeof(float), cudaMemcpyDeviceToDevice);

And other incarnations (e.g., (void*) in front of the arrays), but still no success. And, yes, gg and Res are ordinary float arrays. I’ll keep poking at it; I am sure the right voodoo will occur to me eventually.

Res is allocated:

NN=mk*mo;

cudaMalloc ((void **)&Res, 2*NN * sizeof(Res[0]));

gg is allocated:

cublasAlloc (Ic*Lc, sizeof(float), (void**)&gg);

(Ic is I padded to be a multiple of 32, etc.)

Boxed_Cylon · June 23, 2008, 3:55am

I’ve tried everything I can think of in configurations of cudaMemcpy2D and no luck anywhere. I am baffled and befuddled on how to get this to work. (At this point I suspect it doesn’t work in this case.)

Boxed_Cylon · June 24, 2008, 4:43am

I’ll bug this forum one more time (which is probably once too many) before I give up. I’ve gotten the copy to sort of work with the line:

cudaMemcpy2D(gg, Ic*sizeof(float), Res, I*sizeof(float), 

         I*sizeof(float), L, cudaMemcpyDeviceToDevice);

It only sort of works, however. The values are copied, but the desired zero padding does not occur.

The problem boils down to this:

I have a float* vector on the device (1 2 3 4 5 6 7 8 9)’ that I want to copy into a 5X4 float* array on the device like this:

1 4 7 0

2 5 8 0

3 6 9 0

0 0 0 0

(or the equivalent linear representation) Can someone suggest a way to do that? The cublas Get/Set routines seem to do this for me fine, but perhaps they are using CUDA arrays behind the scenes. They also require two copies host<->device, whereas it would be nice to do one copy device<->device.

The cudaMemcpy call above is giving me:

1 6 0 0

2 7 0 0

3 8 0 0

4 9 0 0

5 0 0 0

which does not work for me. (I am using the CUDA BLAS, hence padding arrays so that they are sized in multiples of 32.)

Thanks for your patience…(at the moment is this is not a critical problem, so much as a learning adventure)

Boxed_Cylon · June 25, 2008, 10:58pm

If at first you don’t succeed…

And the answer is…(drum roll please)…

 cudaMemcpy2D(gg, Lc*sizeof(float), Res, L*sizeof(float), 

         L*sizeof(float), I, cudaMemcpyDeviceToDevice);

What I thought were rows were actually columns for Memcpy.

I was aware of differing array ordering vs. rows/columns, but didn’t quite know what that meant until now. cublas has one convention, Memcpy has the other! (I think the documentation says that.)

The effect for me is a 2% speed up in my calculations (and an important lesson…)

Never give up…there is always one last thing to try… :)

Reimar · June 26, 2008, 6:36am

Good that it worked out for you, but one comment, here:

cudaMemset(gg,0.0,IcLc4);

Using 0.0 is a very bad idea, it will only cause someone to try 1.0 instead and find out it does not work at all. memset sets every byte to the same value, so that parameter should be a byte value (i.e. 0 instead of 0.0).

In addition this also means that you can not memset float arrays to much else than 0.0, and it also means that some people consider it bad style because it assumes a certain format of the float numbers (though I do not know of a float format in significant use where all-zeros does not represent the number 0.0).

It would be cleaner and probably not much slower to write a kernel to set the whole array to some value, I guess cuda does not yet have something like that?

Boxed_Cylon · June 26, 2008, 7:41am

Staring at the entry for cudaMemset in the reference manual, I see you are right about that. It is a byte-wise setting of this value. (Seems a rather odd utilility; but I am sure the CUDA people know what they are doing.) In other places I was using 0.0f as the value, which byte-wise makes no sense. So, yes, there ought to be a standard utility for setting values to floating point arrays (zeros, ones, whatever); easy enough to implement.

Thanks for pointing that out!

Topic		Replies	Views
cudaMemset or cudaMemset2D set memory with float values CUDA Programming and Performance	10	30353	March 29, 2012
CudaMallocPitch and CudaMemcpy2D CUDA Programming and Performance	7	5518	August 3, 2015
2d array testing in very simple code using CUDA CUDA Programming and Performance	29	30400	November 15, 2010
help cudaMemcpy2d Trying to modify a 2d array on cuda device CUDA Programming and Performance	8	4975	September 11, 2010
cudaMemcpyArrayToArray question CUDA Programming and Performance	7	3047	July 29, 2008
CudaFree 2D-Array CUDA Programming and Performance	10	9349	August 3, 2009
Can we do malloc inside a __global__ function CUDA Programming and Performance	26	9658	February 21, 2010
Problems with creating an array of Cuda pointers CUDA Programming and Performance	7	13576	April 20, 2009
How to cudaMalloc two-dimensional array ? CUDA Programming and Performance	46	66171	September 7, 2023
Passing a multidimensional array to kernel how to allocate space in host and pass to device? CUDA Programming and Performance	12	16188	November 22, 2014

newbie question: cudaMemcpy2DArrayToArray

Related topics