Can we do malloc inside a __global__ function

Here is what I want to do :

__ global__
void function(…)
{
float ***vav[3];

vav[0] = (float***) malloc ((meshmax+1)* sizeof(float) );
for (int i=0; i< nmesh; i++)
{ vav[i][0] = (float**) malloc ((meshmax+1)* sizeof(float) );
for (int j=0; j< nmesh; j++)
vav[i][j][0] = (float*) malloc ((meshmax+1)* sizeof(float) );
}
}

I have a bunch of arrays like that which only need to be calculated inside the kernel.
Is this ligit?
This is compiling fine , but as I try to assign a value to vav[i][j][k] , I got at seg fault.

Hi Jam1

I think you can’t since it is not allowed to call library functions inside a kernel.

I am afraid the only way is that you have to use cudaMalloc (or functions of the same family) before calling your kernel.

Matt

This is strange. It seems that the malloc call does not generate the error but only the assignment.
Why would that be? And why no error message during compile?
If I do cudamalloc ouside the kernel, do I need to pass the array as argument?

Thanks for any clues.

That shouldn’t be the case. nvcc should emit a “error: calling a host function from a device/global function is only allowed in device emulation mode” error and stop.

No, you can also write the address of the allocation onto a constant memory symbol as well. But the kernel has to be made aware of it one way or the other.

Oops, I was compiling with emulation ON.

But then, should it work at all in emulation mode ?

You are right, it does give an error without emulation and does not compile.

Maybe I am spending too much time in emulation/debug mode to get this program going and I forgot to swich :(

How would you do that?

Thanks

Emulation is purely host code, 100% compiled by the host compiler. It would be more surprising if it didn’t work.

cudaMemcpyToSymbol

thank you, I will try that.

The last line of the first post said that it give an error in emulation. Maybe not related to this problem.

There are two issues - not compiling and not working. The not working part almost certainly an error in your code. The not compiling part is down to what features are/are not supported by the different compilation trajectories nvcc uses depending on whether you are compiling for emulation or the device.

I am trying to male this cudaMemcpyToSymbol , but no luck so far.

I looked at the examples in SDK, but none fit my task exactly. For example, how do I declare the symbol ?
I tried the following outside the functions:
float ***vav[3]; // it says invalid device symbol

in the call

cutilSafeCall( cudaMemcpyToSymbol(vav, h_vav, 121212*sizeof(float) ));

Then I tried to put constant , device, shared, global in front of float, nonthing works.
Thanks for your help

You can see a working example here.

Ok, I can see that _device float is declared outside the function.

This example is not what I need to do. In the example, the values are passed with cudaMemcpy. In my case, some arrays in the kernel are temporary only and are using the values from other arrays which are passed with cudaMemcpy. The problem lies with the declaration of these temporary arrays.

There is no problem with scalar values beiing temporary and local to the kernel but it is not so for arrrays where some a-priori allocation is needed it seems.

I will play with this example to see if I can advance my case.

You asked how to use cudaMemcpyToSymbol to write the address of dynamically allocated memory onto a device side pointer. The code I linked to does exactly that. Sorry for not being more helpful…

I compiled and ran your example:

  1. it gives a warning

Warning: Cannot tell what pointer points to, assuming global memory space

  1. It gives an error in release mode .

FAILURE 38 in forums_example_symbol.cu, line 31

I had to remove all the gpuAssert call

  1. It gives a different answer than emulation

I will modify to get what I want.

That is normal. It compiles and runs otherwise (Ubuntu 9.04 x86_64, cuda 2.,3):

avid@cuda:~/code/cuda_struct$ nvcc -Xopencc="-Wall" example2.cu -o example2

In file included from example2.cu:13:

/opt/cuda-2.3/bin/../include/common_functions.h: In function `__cuda_clock':

/opt/cuda-2.3/bin/../include/common_functions.h:72: warning: implicit declaration of function `clock'

/opt/cuda-2.3/bin/../include/common_functions.h: In function `__cuda_memset':

/opt/cuda-2.3/bin/../include/common_functions.h:77: warning: implicit declaration of function `memset'

/opt/cuda-2.3/bin/../include/common_functions.h: In function `__cuda_memcpy':

/opt/cuda-2.3/bin/../include/common_functions.h:82: warning: implicit declaration of function `memcpy'

example2.cu: At top level:

/opt/cuda-2.3/bin/../include/common_functions.h:71: warning: `__cuda_clock' defined but not used

/opt/cuda-2.3/bin/../include/common_functions.h:76: warning: `__cuda_memset' defined but not used

/opt/cuda-2.3/bin/../include/common_functions.h:81: warning: `__cuda_memcpy' defined but not used

./example2.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space

./example2.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space

./example2.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space

avid@cuda:~/code/cuda_struct$ ./example2

 0   15.0

 1   16.0

 2   17.0

 3   18.0

 4   19.0

 5   20.0

 6   21.0

 7   22.0

 8   23.0

 9   24.0

10   25.0

11   26.0

12   27.0

13   28.0

14   29.0

15   30.0

16   31.0

17   32.0

18   33.0

19   34.0

20   35.0

21   36.0

22   37.0

23   38.0

24   39.0

25   40.0

26   41.0

27   42.0

28   43.0

29   44.0

30   45.0

31   46.0

That is the NoDevice error code. Something in your CUDA installation is broken.

Got it to work finally. When I change gpuAssert to cutilSafeCall I saw that no cuda device was available. I restarted X and everything compile and ran as expected for both emulation and release. I also modify the program to do what I want and it works also. I can assign values to the array ad[idx][0]. Back to regular programming I guess.
Thank you very much for your help.

This is something similar I want to do:

[codebox]#include <assert.h>

#include <stdio.h>

#include <cuda_runtime.h>

#include <cutil_inline.h>

#ifndef gpuAssert

#include <stdio.h>

#define gpuAssert( condition ) {if( (condition) != 0 ) { fprintf( stderr, “\n FAILURE %d in %s, line %d\n”, condition, FILE, LINE );exit( 1 );}}

#endif

#define _DSIZE 4

device float ***ad;

global void testkernel2(float *d)

{

unsigned int idx = threadIdx.x + blockDim.x*blockIdx.x + 1 ;

ad[idx][idx][idx] = 23.0*idx;                                       //        this is what I need to do  

d[idx] = ad[idx][idx][idx];                                          //          in my kernel         

}

int main()

{

int device = 0;

cudaSetDevice(device);

float  *d ,*_d;

float ***_a;                                     //  3D arrays           can I allocate with malloc  .....       

assert( !(( d = (float *)malloc(_DSIZE * sizeof(float)) ) == NULL) );

gpuAssert( cudaMalloc( (void**)&_a, _DSIZE * sizeof(float) ) ); // Does this needs cudaMalloc3D or cudaMallocArray?

gpuAssert( cudaMalloc( (void**)&_d, _DSIZE * sizeof(float) ) );

gpuAssert ( cudaMemcpy(_d, d, _DSIZE * sizeof(float), cudaMemcpyHostToDevice) );

gpuAssert( cudaMemcpyToSymbol( ad, &_a, sizeof(float ))); // … and use this to transfer the alloction address?

testkernel2 <<< 1, _DSIZE >>> (_d);

( cudaThreadSynchronize() );

( cudaMemcpy(d, _d, _DSIZE * sizeof(float), cudaMemcpyDeviceToHost) );

for(int i = 0; i < _DSIZE; i++) {

    fprintf(stdout, "%2d %6.1f\n", i, d[i]);

}

cudaFree(_a);

cudaFree(_d);

free(d);

return cudaThreadExit();

}

[/codebox]

Is it possible? I have tried many variations so far but no success except with *ad[3]

I think you need to hit the books on C pointers and memory management - your problem is well out of the realm of CUDA itself.

If you really want a dynamically allocated 3D array (there are lots of reasons why it is neither necessary, nor a good idea), then you are going to have to recursively allocate the memory for it using cudaMalloc, which will be a nightmare. Save yourself a lot of trouble and use 3D indexing into a 1D array.

Thank you for showing me one way:

[codebox]

include <assert.h>

include <stdio.h>

include <cuda_runtime.h>

include <cutil_inline.h>

// 3D to linear array conversion

define FTNREF3D(i_index,j_index,k_index,i_size,j_size,i_lb,j_lb,k_l

b) (i_size)(j_size)(k_index-k_lb)+(i_size)*(j_index-j_lb)+i_index-i_lb

#ifndef gpuAssert

include <stdio.h>

define gpuAssert( condition ) {if( (condition) != 0 ) { fprintf( stderr, “\n FAILURE %d in %s, line %d\n”, condition, FILE, LINE );exit( 1 );}}

endif

define _DSIZE 4

device float *ad;

global void testkernel2(float *d)

{

unsigned int idx = threadIdx.x + blockDim.x*blockIdx.x + 1 ;

ad[FTNREF3D(idx,idx,idx,_DSIZE,_DSIZE,1,1,1)] = 23.0*idx;           

d[idx] =ad[FTNREF3D(idx,idx,idx,_DSIZE,_DSIZE,1,1,1)] ;                   

}

int main()

{

int device = 0;

cudaSetDevice(device);

float  *d ,*_d;

float *_a;                 

assert( !(( d = (float *)malloc(_DSIZE * sizeof(float)) ) == NULL) );

gpuAssert( cudaMalloc( (void**)&_a, _DSIZE*_DSIZE*_DSIZE* sizeof(float) ) );

gpuAssert( cudaMalloc( (void**)&_d, _DSIZE * sizeof(float) ) );

gpuAssert ( cudaMemcpy(_d, d, _DSIZE * sizeof(float), cudaMemcpyHostToDevice) );

gpuAssert( cudaMemcpyToSymbol( ad, &_a, sizeof(float )));

testkernel2 <<< 1, _DSIZE >>> (_d);

( cudaThreadSynchronize() );

( cudaMemcpy(d, _d, _DSIZE * sizeof(float), cudaMemcpyDeviceToHost) );

for(int i = 0; i < _DSIZE; i++) {

    fprintf(stdout, "%2d %6.1f\n", i, d[i]);

}

cudaFree(_a);

cudaFree(_d);

free(d);

return cudaThreadExit();

}[/codebox]

This is working finally.

But, I just dicover that I cannot set _DSIZE to be more than 31 in the previous code, i.e. the maximum 3D array is 31x31x31 (29,791) why?
It is not a function of FTNREF3D macro, it tried without it.
I read in the reference manual, that I can allocate with cudaMalloc3DArray and array up to (8192,0,0) for a 1D array and (2048,2048,2048) for a 3D array. But in my case I can access up to index 29791 for a 1D array(vector).
What I would like to see is _DSIZE to be up to 128.

You can’t use cudaMalloc3DArray the way you are trying to. cudaMalloc3DArray is for allocating memory for storing 3D texture data. All cudaArrays are opaque memory types that are not intended to be manipulated by user code other than the texture access calls and the alloc/copy APIs.

As I said in my previous reply, if you want dynamically allocated 3D arrays, you will have to recursively call cudaMalloc - for an nxnxn 3D array, that means a total of 1 + n + n**2 malloc calls, and all sorts of convoluted, recursive copy to symbol calls or initialization kernels. And when you are done, you will have made everything slower because there are now two extra levels of pointer indirection required to get to your data. Just use a 1D linear memory space allocated with cudaMalloc and index into it using a 3D index.