__device__ function array help

Hi,

I think I have to study more. I have a problem of comprehension about local memory and how device function works with array.
I want to create a two dimensional tmp[aNum][bNum] in my device function

this is a part of my program.

//kernel
global void myKernel(int aNum, int bNum, int cNum)
{
int thx = blockIdx.x * blockDim.x + threadIdx.x;
if(thx < cNum){
myClass(aNum, bNum, cNum);
__syncthreads();
}

}
//device function
device int myClass(int aNum,int bNum, int csNum){
//local 2d array
float tmp;

}
I got this compiler error
" incomplete type is not allowed"

If I make aNum, bNum constant It works but this is not
what i want to do.

Any help will be appreciated.

Thank you

Okay, there are a couple things you should be aware of. As far as I know, you cannot dynamically allocate memory within a kernel. In this case, it looks like you don’t know the required size of your array at compile time, but it looks like you do know the required size before launching the kernel. So before you launch the kernel, dynamically allocate memory on the gpu/host memory, and pass in a pointer to that memory with a parameter when you call your kernel. Then, each thread can work off that memory. This approach may be slow if you’re using an older 8x, 9x, or 2xx gpu, but if you’re using a newer, GTX 4xx gpu (Fermi), the performance of your program should be good due to L1 cache.

Okay, there are a couple things you should be aware of. As far as I know, you cannot dynamically allocate memory within a kernel. In this case, it looks like you don’t know the required size of your array at compile time, but it looks like you do know the required size before launching the kernel. So before you launch the kernel, dynamically allocate memory on the gpu/host memory, and pass in a pointer to that memory with a parameter when you call your kernel. Then, each thread can work off that memory. This approach may be slow if you’re using an older 8x, 9x, or 2xx gpu, but if you’re using a newer, GTX 4xx gpu (Fermi), the performance of your program should be good due to L1 cache.

Hi ColinS,

Thank you for your reply.
I’m using a quadro fx 4800 card.

It was the final way I adopted to make my program work.
I was wondering if there were a better way to accomplish since
the memory allocated will be in the global memory.
Once again thank you.

Hi ColinS,

Thank you for your reply.
I’m using a quadro fx 4800 card.

It was the final way I adopted to make my program work.
I was wondering if there were a better way to accomplish since
the memory allocated will be in the global memory.
Once again thank you.