CUDA array host to device

casybaby · February 19, 2008, 3:47pm

Hi folks,

I need to the following things:

#loopBegin
1. allocate 3 char arrays in the device, let’s say a[m], b[n][2], c[l][2].
2. In the device side, I will use a and b as inputs of a kernel to get the value of c.
3. Copy c back to the host.
4. Free the memories of a b c in the device.
#loopEnd

The value of a b c will be changed each time in the loop. 
Could you please show me the code to do this?
Appreciate.

wildcat4096 · February 19, 2008, 4:07pm

It is not exactly the answer that you are looking for, but if you can code your request in C as a function of a, b, and c you are 75% of the way to coding it in CUDA. Just check out the first few chapters of the CUDA Programming Manual which will show you how to do the memory allocation/deallocation, the syntax for invocation of the kernel, and how to call the compiler. CUDA also comes with several examples, some small and some complex. After getting your code working then take a look at the performance suggestions discussed in the manual.

wumpus · February 19, 2008, 4:16pm

You shouldn’t allocate/deallocate a,b,c every time in your loop. Just allocate them in the beginning, and deallocate them at the ending of your program.

casybaby · February 19, 2008, 4:28pm

But in my case, the a, b, c will be changed every time. And their current value depends on the previous calculation.

wildcat4096 · February 19, 2008, 4:56pm

Are the values of c independent and are the values of a and b constant?

If so could you do the following?

#CPUloopBegin

  <CPU> 1. allocate 3 char arrays in the device, let's say a[m], b[n][2], c[l][2].

  <CPU> 2. Execute GPU kernel f(a,b,c)

  <GPU>     a. In the device side, for each thread i:

  <GPU>         I. c[i][...] = f(a, b )

  <GPU> 2b. GPU kernel complete

  <CPU> 3. Copy c back to the host.

  <CPU> 4. Free the memories of a b c in the device.

#CPUloopEnd

casybaby · February 19, 2008, 5:04pm

Yes, that’s exactly what I am doing. I already have the structure for the code. But I am just so confused by the pointers and the data structs in C. I know I gotta be more familiar with C first before I start CUDA. But it is quite emergent, so I post it here to ask for help.

If I am using a 2D array, am I supposed to allocate memory using cudaMallocPicth()?

Could anyone explain what that pitch is for me?

Thx!!

Are the values of c independent and are the values of a and b constant?

If so could you do the following?

#CPUloopBegin

  <CPU> 1. allocate 3 char arrays in the device, let's say a[m], b[n][2], c[l][2].

  <CPU> 2. Execute GPU kernel f(a,b,c)

  <GPU>     a. In the device side, for each thread i:

  <GPU>         I. c[i][...] = f(a, b )

  <GPU> 2b. GPU kernel complete

  <CPU> 3. Copy c back to the host.

  <CPU> 4. Free the memories of a b c in the device.

#CPUloopEnd

[snapback]329536[/snapback]

kristleifur · February 19, 2008, 5:07pm

Does the array size depend on the last calculation?

I’d still take Wumpus’s advice and allocate them once in the beginning, just allocate a maximum size if you know it.

Allocation != value assignment.

wumpus · February 19, 2008, 5:16pm

And if you don’t know the maximum size, you could allocate some buffer in the beginning, and reallocate if it turns out that the size is larger than the previous size (best is to choose the next higher power of two and allocate that, so you don’t keep reallocating if it grows a bit every step).

wildcat4096 · February 19, 2008, 5:29pm

I almost always allocate multidimensional arrays as a single dimension array and do the appropriate indexing, e.g.

int a[N][M]

can be written as

int ap[N*M]

and indexed

a[i*M+j] where i is [0,N) and j is [0,M)

casybaby · February 19, 2008, 5:42pm

The array size differs a lot every times in a loop, the maximum could be more than a thousand, the minimum could only be one. And I cannot tell how often I can have two long enough arrays, it depends on the actual case. Do you still think it is a good idea to allocate that much space for arrays?

wildcat4096 · February 19, 2008, 6:49pm

So, max{m,n,l} = 1000? So, if you were to collapse the second dimension you would still only have 2000 elements. You know the types used so you can estimate the memory used. Shoot, if you have the same double for safety. If you have the space on the card, do what wumpus says and save the overhead of allocation/deallocation. When shuffling memory, you will only need to fill a,b and copy back from c only what is needed. Copying the entire oversized array would not be desirable.

(Sorry, misread your message. I see max{m,n,l} > 1000. Regardless, since you know the type you can estimate the memory needed. You could even preallocate up to the maximum allowed on the card, bail out or split up too large of a problem, but still only copy what needs to be to/from the card.)

casybaby · February 19, 2008, 7:53pm

So, max{m,n,l} = 1000? So, if you were to collapse the second dimension you would still only have 2000 elements. You know the types used so you can estimate the memory used. Shoot, if you have the same double for safety. If you have the space on the card, do what wumpus says and save the overhead of allocation/deallocation. When shuffling memory, you will only need to fill a,b and copy back from c only what is needed. Copying the entire oversized array would not be desirable.

(Sorry, misread your message. I see max{m,n,l} > 1000. Regardless, since you know the type you can estimate the memory needed. You could even preallocate up to the maximum allowed on the card, bail out or split up too large of a problem, but still only copy what needs to be to/from the card.)

[snapback]329610[/snapback]

Thanks. I will take all your advise to allocate memory before the loop.

And how about the blocks? In the device, each thread will calculate one element for c. The number of threads I need depends on the length of the array. Sometimes could be little, sometimes could be large. It is like 1, 2, 3,…,1498,1499,1500, 1500, 1500, …1500,1499,1498…3,2,1. Each number represents the thread number in each loop.

I know that 512 threads maximum / block. Should I assign a constant number of blocks ahead, or should I assign block number in each loop? I am not sure how much overhead I will get to assign it.

Topic		Replies	Views
How get in host the memory allocated from device CUDA Programming and Performance	10	3214	August 16, 2017
Passing a multidimensional array to kernel how to allocate space in host and pass to device? CUDA Programming and Performance	12	16389	November 22, 2014
cudaMallocPitch CUDA Programming and Performance	0	2301	August 6, 2007
How to define a three-dimensional array? define a three-dimensional array on GPU CUDA Programming and Performance	13	11997	October 10, 2008
2d array testing in very simple code using CUDA CUDA Programming and Performance	29	30606	November 15, 2010
How to allocate a 3d array such that you can use the indecies to access its elements CUDA Programming and Performance	20	5623	October 24, 2009
Reallocating memory on CUDA? CUDA Programming and Performance	2	4907	January 17, 2011
Allocating an array of pitched arrays CUDA Programming and Performance	13	6628	September 30, 2011
2 dimensional array CUDA Programming and Performance	7	5752	January 2, 2009
[ask] array operation and memory allocation CUDA Programming and Performance	1	2636	July 21, 2010

CUDA array host to device

Related topics