Allocate non-constant number of arrays on device

Hi everyone,

I’m new user of CUDA, so sorry if my question looks stupid. I read several topic in this forum by I didn’t see the solution :-(

I’m transforming a c code in cuda code. The “kernel” algorithm uses dynamic programming (each new value needs previous computations). Each thread performs this algorithm and constructs a matrix (N*M). Both lengths depends on the input data. N is the same for all threads, but M is different.

So I want allocate T arrays of size (N*M) of float on the device. In the following code, N is the “input_size”, M is given by the “reference_size” array and T corresponds to the “number_reference”. For arrays allocation no problem (if I’ve not made mistakes) :

[codebox]

float* compute_score(float2** references_vec, float2* input_vec, int number_reference, int* reference_size, int input_size)

[…]

    for(int i = 0; i < number_reference; i++) {

        float* devPtr;

        int pitch;

            cudaMallocPitch((void**)&devPtr, &pitch,reference_size[i]*sizeof(float),input_size);

   }[/codebox]

or something like that.

My problem is how to store the pointers “devPtr” in an array in order to each thread can find its array. (I guess that a float** array declared in host code can not store device pointers ).

I read that cudaGetSymbolAddress works only for “device” variables.

One solution is to use 3D array, but it means a huge loss of memory.

I tried many things but none works.

Thank you for your help !

Hi,

at first you have to allocate memory on the device for the pointers to the arrays. It may look like this (not testet):

cudaMalloc((void**)&devPtr, number_of_arrays*sizeof(float*));

float **devPtrDummy;

devPtrDummy = malloc(number_of_arrays*sizeof(float*));

Now there’s an array of float* on the device and on the host.

The second step is to loop over all the pointers and to allocate the desired number of array elements for each array:

for(i = 0; i< number_of_arrays; i++)

  cudaMalloc((void**)(devPtrDummy+i), length_of_array[i]*sizeof(float));

Now, you have allocatet enough mem for the arrays on the device. But these arrays are not reachable from the device:

cudaMemcpy((void*)devPtr, (void*)devPtrDummy, cudaMemcpyHostToDevice);

Now devPtr points to an array (on device) of pointers to an array (on device).

Hope I answered your question :unsure:

Thank you, it looks working (I have to try on my complete code).

I have one more question :

I tried the following code :

[codebox]device float** devPtr;

global myFunc() {

    float* tmp = devPtr[num_thread]; //num thread is well calculated

    ...

}

void init() {

    //int size well calculated

    cudaMalloc((void**)&devPtr, size);

    float** tmp = (float**) malloc(size);

    for(int i = 0; i < N; i++) {

            ...

    }

    cudaMemcpy(devPtr,tmp...)

    ...

    myFunc<<<dimGrid,dimBlock>>>();

    ...

}[/codebox]

My problem is that in the kernel function, devPtr[num_thread] return seg fault (I work in emulation mode). Apparently the devPtr is set to (nil) in kernel, and I don’t understant why : cudaMalloc returns no error and the pointer exists.

If I put devPtr as parameter of kernel function, all works. So apparently devPtr isn’t initialized in global scope. Maybe I made a mistake, but I thought that device float** devPtr was global and persistent.

Thank you