Can we do malloc inside a global function

jam11 · February 11, 2010, 3:19pm

Here is what I want to do :

__ global__
void function(…)
{
float ***vav[3];

vav[0] = (float***) malloc ((meshmax+1)* sizeof(float) );
for (int i=0; i< nmesh; i++)
{ vav[i][0] = (float**) malloc ((meshmax+1)* sizeof(float) );
for (int j=0; j< nmesh; j++)
vav[i][j][0] = (float*) malloc ((meshmax+1)* sizeof(float) );
}
}

I have a bunch of arrays like that which only need to be calculated inside the kernel.
Is this ligit?
This is compiling fine , but as I try to assign a value to vav[i][j][k] , I got at seg fault.

Jatukam · February 11, 2010, 3:27pm

Hi Jam1

I think you can’t since it is not allowed to call library functions inside a kernel.

I am afraid the only way is that you have to use cudaMalloc (or functions of the same family) before calling your kernel.

Matt

jam11 · February 11, 2010, 4:09pm

This is strange. It seems that the malloc call does not generate the error but only the assignment.
Why would that be? And why no error message during compile?
If I do cudamalloc ouside the kernel, do I need to pass the array as argument?

Thanks for any clues.

avidday · February 11, 2010, 4:15pm

That shouldn’t be the case. nvcc should emit a “error: calling a host function from a device/global function is only allowed in device emulation mode” error and stop.

No, you can also write the address of the allocation onto a constant memory symbol as well. But the kernel has to be made aware of it one way or the other.

jam11 · February 11, 2010, 5:03pm

Oops, I was compiling with emulation ON.

But then, should it work at all in emulation mode ?

You are right, it does give an error without emulation and does not compile.

Maybe I am spending too much time in emulation/debug mode to get this program going and I forgot to swich :(

How would you do that?

Thanks

avidday · February 11, 2010, 5:08pm

Emulation is purely host code, 100% compiled by the host compiler. It would be more surprising if it didn’t work.

cudaMemcpyToSymbol

jam11 · February 11, 2010, 5:18pm

thank you, I will try that.

The last line of the first post said that it give an error in emulation. Maybe not related to this problem.

avidday · February 11, 2010, 5:29pm

There are two issues - not compiling and not working. The not working part almost certainly an error in your code. The not compiling part is down to what features are/are not supported by the different compilation trajectories nvcc uses depending on whether you are compiling for emulation or the device.

jam11 · February 12, 2010, 2:56pm

I am trying to male this cudaMemcpyToSymbol , but no luck so far.

I looked at the examples in SDK, but none fit my task exactly. For example, how do I declare the symbol ?
I tried the following outside the functions:
float ***vav[3]; // it says invalid device symbol

in the call

cutilSafeCall( cudaMemcpyToSymbol(vav, h_vav, 121212*sizeof(float) ));

Then I tried to put constant , device, shared, global in front of float, nonthing works.
Thanks for your help

avidday · February 12, 2010, 2:59pm

You can see a working example here.

jam11 · February 12, 2010, 3:25pm

Ok, I can see that _device float is declared outside the function.

This example is not what I need to do. In the example, the values are passed with cudaMemcpy. In my case, some arrays in the kernel are temporary only and are using the values from other arrays which are passed with cudaMemcpy. The problem lies with the declaration of these temporary arrays.

There is no problem with scalar values beiing temporary and local to the kernel but it is not so for arrrays where some a-priori allocation is needed it seems.

I will play with this example to see if I can advance my case.

avidday · February 12, 2010, 3:39pm

You asked how to use cudaMemcpyToSymbol to write the address of dynamically allocated memory onto a device side pointer. The code I linked to does exactly that. Sorry for not being more helpful…

jam11 · February 12, 2010, 3:42pm

I compiled and ran your example:

it gives a warning

Warning: Cannot tell what pointer points to, assuming global memory space

It gives an error in release mode .

FAILURE 38 in forums_example_symbol.cu, line 31

I had to remove all the gpuAssert call

It gives a different answer than emulation

I will modify to get what I want.

avidday · February 12, 2010, 3:51pm

That is normal. It compiles and runs otherwise (Ubuntu 9.04 x86_64, cuda 2.,3):

avid@cuda:~/code/cuda_struct$ nvcc -Xopencc="-Wall" example2.cu -o example2

In file included from example2.cu:13:

/opt/cuda-2.3/bin/../include/common_functions.h: In function `__cuda_clock':

/opt/cuda-2.3/bin/../include/common_functions.h:72: warning: implicit declaration of function `clock'

/opt/cuda-2.3/bin/../include/common_functions.h: In function `__cuda_memset':

/opt/cuda-2.3/bin/../include/common_functions.h:77: warning: implicit declaration of function `memset'

/opt/cuda-2.3/bin/../include/common_functions.h: In function `__cuda_memcpy':

/opt/cuda-2.3/bin/../include/common_functions.h:82: warning: implicit declaration of function `memcpy'

example2.cu: At top level:

/opt/cuda-2.3/bin/../include/common_functions.h:71: warning: `__cuda_clock' defined but not used

/opt/cuda-2.3/bin/../include/common_functions.h:76: warning: `__cuda_memset' defined but not used

/opt/cuda-2.3/bin/../include/common_functions.h:81: warning: `__cuda_memcpy' defined but not used

./example2.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space

./example2.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space

./example2.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space

avid@cuda:~/code/cuda_struct$ ./example2

 0   15.0

 1   16.0

 2   17.0

 3   18.0

 4   19.0

 5   20.0

 6   21.0

 7   22.0

 8   23.0

 9   24.0

10   25.0

11   26.0

12   27.0

13   28.0

14   29.0

15   30.0

16   31.0

17   32.0

18   33.0

19   34.0

20   35.0

21   36.0

22   37.0

23   38.0

24   39.0

25   40.0

26   41.0

27   42.0

28   43.0

29   44.0

30   45.0

31   46.0

That is the NoDevice error code. Something in your CUDA installation is broken.

jam11 · February 12, 2010, 4:18pm

Got it to work finally. When I change gpuAssert to cutilSafeCall I saw that no cuda device was available. I restarted X and everything compile and ran as expected for both emulation and release. I also modify the program to do what I want and it works also. I can assign values to the array ad[idx][0]. Back to regular programming I guess.
Thank you very much for your help.

jam11 · February 14, 2010, 5:56pm

This is something similar I want to do:

[codebox]#include <assert.h>

#include <stdio.h>

#include <cuda_runtime.h>

#include <cutil_inline.h>

#ifndef gpuAssert

#include <stdio.h>

#define gpuAssert( condition ) {if( (condition) != 0 ) { fprintf( stderr, “\n FAILURE %d in %s, line %d\n”, condition, FILE, LINE );exit( 1 );}}

#endif

#define _DSIZE 4

device float ***ad;

global void testkernel2(float *d)

{

unsigned int idx = threadIdx.x + blockDim.x*blockIdx.x + 1 ;

ad[idx][idx][idx] = 23.0*idx;                                       //        this is what I need to do  

d[idx] = ad[idx][idx][idx];                                          //          in my kernel

}

int main()

{

int device = 0;

cudaSetDevice(device);

float  *d ,*_d;

float ***_a;                                     //  3D arrays           can I allocate with malloc  .....

assert( !(( d = (float *)malloc(_DSIZE * sizeof(float)) ) == NULL) );

gpuAssert( cudaMalloc( (void**)&_a, _DSIZE * sizeof(float) ) ); // Does this needs cudaMalloc3D or cudaMallocArray?

gpuAssert( cudaMalloc( (void**)&_d, _DSIZE * sizeof(float) ) );

gpuAssert ( cudaMemcpy(_d, d, _DSIZE * sizeof(float), cudaMemcpyHostToDevice) );

gpuAssert( cudaMemcpyToSymbol( ad, &_a, sizeof(float ))); // … and use this to transfer the alloction address?

testkernel2 <<< 1, _DSIZE >>> (_d);

( cudaThreadSynchronize() );

( cudaMemcpy(d, _d, _DSIZE * sizeof(float), cudaMemcpyDeviceToHost) );

for(int i = 0; i < _DSIZE; i++) {

    fprintf(stdout, "%2d %6.1f\n", i, d[i]);

}

cudaFree(_a);

cudaFree(_d);

free(d);

return cudaThreadExit();

}

[/codebox]

Is it possible? I have tried many variations so far but no success except with *ad[3]

avidday · February 14, 2010, 6:29pm

I think you need to hit the books on C pointers and memory management - your problem is well out of the realm of CUDA itself.

If you really want a dynamically allocated 3D array (there are lots of reasons why it is neither necessary, nor a good idea), then you are going to have to recursively allocate the memory for it using cudaMalloc, which will be a nightmare. Save yourself a lot of trouble and use 3D indexing into a 1D array.

jam11 · February 15, 2010, 12:35am

Thank you for showing me one way:

[codebox]

include <assert.h>

include <stdio.h>

include <cuda_runtime.h>

include <cutil_inline.h>

// 3D to linear array conversion

define FTNREF3D(i_index,j_index,k_index,i_size,j_size,i_lb,j_lb,k_l

b) (i_size)(j_size)(k_index-k_lb)+(i_size)*(j_index-j_lb)+i_index-i_lb

#ifndef gpuAssert

include <stdio.h>

define gpuAssert( condition ) {if( (condition) != 0 ) { fprintf( stderr, “\n FAILURE %d in %s, line %d\n”, condition, FILE, LINE );exit( 1 );}}

endif

define _DSIZE 4

device float *ad;

global void testkernel2(float *d)

{

unsigned int idx = threadIdx.x + blockDim.x*blockIdx.x + 1 ;

ad[FTNREF3D(idx,idx,idx,_DSIZE,_DSIZE,1,1,1)] = 23.0*idx;           

d[idx] =ad[FTNREF3D(idx,idx,idx,_DSIZE,_DSIZE,1,1,1)] ;

}

int main()

{

int device = 0;

cudaSetDevice(device);

float  *d ,*_d;

float *_a;

assert( !(( d = (float *)malloc(_DSIZE * sizeof(float)) ) == NULL) );

gpuAssert( cudaMalloc( (void**)&_a, _DSIZE*_DSIZE*_DSIZE* sizeof(float) ) );

gpuAssert( cudaMalloc( (void**)&_d, _DSIZE * sizeof(float) ) );

gpuAssert ( cudaMemcpy(_d, d, _DSIZE * sizeof(float), cudaMemcpyHostToDevice) );

gpuAssert( cudaMemcpyToSymbol( ad, &_a, sizeof(float )));

testkernel2 <<< 1, _DSIZE >>> (_d);

( cudaThreadSynchronize() );

( cudaMemcpy(d, _d, _DSIZE * sizeof(float), cudaMemcpyDeviceToHost) );

for(int i = 0; i < _DSIZE; i++) {

    fprintf(stdout, "%2d %6.1f\n", i, d[i]);

}

cudaFree(_a);

cudaFree(_d);

free(d);

return cudaThreadExit();

}[/codebox]

This is working finally.

jam11 · February 15, 2010, 2:53pm

But, I just dicover that I cannot set _DSIZE to be more than 31 in the previous code, i.e. the maximum 3D array is 31x31x31 (29,791) why?
It is not a function of FTNREF3D macro, it tried without it.
I read in the reference manual, that I can allocate with cudaMalloc3DArray and array up to (8192,0,0) for a 1D array and (2048,2048,2048) for a 3D array. But in my case I can access up to index 29791 for a 1D array(vector).
What I would like to see is _DSIZE to be up to 128.

avidday · February 15, 2010, 3:08pm

You can’t use cudaMalloc3DArray the way you are trying to. cudaMalloc3DArray is for allocating memory for storing 3D texture data. All cudaArrays are opaque memory types that are not intended to be manipulated by user code other than the texture access calls and the alloc/copy APIs.

As I said in my previous reply, if you want dynamically allocated 3D arrays, you will have to recursively call cudaMalloc - for an nxnxn 3D array, that means a total of 1 + n + n**2 malloc calls, and all sorts of convoluted, recursive copy to symbol calls or initialization kernels. And when you are done, you will have made everything slower because there are now two extra levels of pointer indirection required to get to your data. Just use a 1D linear memory space allocated with cudaMalloc and index into it using a 3D index.

Topic		Replies	Views
segmentation fault at the first cudaMalloc with --device-emulation everything was fine CUDA Programming and Performance	10	4317	January 25, 2010
How to cudaMalloc two-dimensional array ? CUDA Programming and Performance	46	65907	September 7, 2023
malloc memory in kernel linked via in/out variable CUDA Programming and Performance	10	1918	October 17, 2015
cudaMalloc error in big loop CUDA Programming and Performance	12	15583	May 21, 2008
2d array testing in very simple code using CUDA CUDA Programming and Performance	29	30396	November 15, 2010
malloc isn't found when used in a header file CUDA Programming and Performance	9	5050	December 19, 2010
multi dimension array CUDA Programming and Performance	26	32759	February 12, 2010
modifying simulator to work with CUDA.?. CUDA Programming and Performance	36	18515	April 15, 2009
most general form for thread access? CUDA Programming and Performance	10	2531	February 21, 2010
strange error in summation memory problems CUDA Programming and Performance	14	16608	July 21, 2010

Can we do malloc inside a __global__ function

Related topics

Can we do malloc inside a global function