CUDA array host to device

Hi folks,

I need to the following things:

1. allocate 3 char arrays in the device, let’s say a[m], b[n][2], c[l][2].
2. In the device side, I will use a and b as inputs of a kernel to get the value of c.
3. Copy c back to the host.
4. Free the memories of a b c in the device.

The value of a b c will be changed each time in the loop. 
Could you please show me the code to do this?

It is not exactly the answer that you are looking for, but if you can code your request in C as a function of a, b, and c you are 75% of the way to coding it in CUDA. Just check out the first few chapters of the CUDA Programming Manual which will show you how to do the memory allocation/deallocation, the syntax for invocation of the kernel, and how to call the compiler. CUDA also comes with several examples, some small and some complex. After getting your code working then take a look at the performance suggestions discussed in the manual.

You shouldn’t allocate/deallocate a,b,c every time in your loop. Just allocate them in the beginning, and deallocate them at the ending of your program.

But in my case, the a, b, c will be changed every time. And their current value depends on the previous calculation.

Are the values of c independent and are the values of a and b constant?

If so could you do the following?


  <CPU> 1. allocate 3 char arrays in the device, let's say a[m], b[n][2], c[l][2].

  <CPU> 2. Execute GPU kernel f(a,b,c)

  <GPU>     a. In the device side, for each thread i:

  <GPU>         I. c[i][...] = f(a, b )

  <GPU> 2b. GPU kernel complete

  <CPU> 3. Copy c back to the host.

  <CPU> 4. Free the memories of a b c in the device.


Yes, that’s exactly what I am doing. I already have the structure for the code. But I am just so confused by the pointers and the data structs in C. I know I gotta be more familiar with C first before I start CUDA. But it is quite emergent, so I post it here to ask for help.

If I am using a 2D array, am I supposed to allocate memory using cudaMallocPicth()?

Could anyone explain what that pitch is for me?


Does the array size depend on the last calculation?

I’d still take Wumpus’s advice and allocate them once in the beginning, just allocate a maximum size if you know it.

Allocation != value assignment.

And if you don’t know the maximum size, you could allocate some buffer in the beginning, and reallocate if it turns out that the size is larger than the previous size (best is to choose the next higher power of two and allocate that, so you don’t keep reallocating if it grows a bit every step).

I almost always allocate multidimensional arrays as a single dimension array and do the appropriate indexing, e.g.

int a[N][M]

can be written as

int ap[N*M]

and indexed

a[i*M+j] where i is [0,N) and j is [0,M)

The array size differs a lot every times in a loop, the maximum could be more than a thousand, the minimum could only be one. And I cannot tell how often I can have two long enough arrays, it depends on the actual case. Do you still think it is a good idea to allocate that much space for arrays?

So, max{m,n,l} = 1000? So, if you were to collapse the second dimension you would still only have 2000 elements. You know the types used so you can estimate the memory used. Shoot, if you have the same double for safety. If you have the space on the card, do what wumpus says and save the overhead of allocation/deallocation. When shuffling memory, you will only need to fill a,b and copy back from c only what is needed. Copying the entire oversized array would not be desirable.

(Sorry, misread your message. I see max{m,n,l} > 1000. Regardless, since you know the type you can estimate the memory needed. You could even preallocate up to the maximum allowed on the card, bail out or split up too large of a problem, but still only copy what needs to be to/from the card.)

Thanks. I will take all your advise to allocate memory before the loop.

And how about the blocks? In the device, each thread will calculate one element for c. The number of threads I need depends on the length of the array. Sometimes could be little, sometimes could be large. It is like 1, 2, 3,…,1498,1499,1500, 1500, 1500, …1500,1499,1498…3,2,1. Each number represents the thread number in each loop.

I know that 512 threads maximum / block. Should I assign a constant number of blocks ahead, or should I assign block number in each loop? I am not sure how much overhead I will get to assign it.