efficient static arrays in kernel

Howdy,

BACKGROUND:
I’m doing some fairly complex computation in my kernel calls, and I need to give each thread it’s own space in memory.

I’ve been doing this by allocating a large array and pre computing locations in that array for each separate execution to use, but when I’m doing 10,000 computations, giving them each their own computation space when less then 100 will execute at a time is rather wasteful.

To make the problem even more annoying, some of the threads need more memory space then others, though the vast majority will be a set number I can input at when the host calls the kernel function.

PROBLEM:
I’m trying to use shared memory arrays allocated at the start of a device function. Each computation will call this function exactly once from the global function.

on a host function, I would call:

#define DEFINEDSIZE

struct myStruct{[indent]
int x;
int y;
unsigned int iter;
myStruct* pointerToStruct;[/indent]
}

void func( unsigned int index, myOutputStruct* out, unsigned int inputSize){[indent]
//arrays for local computation
myStruct* array1[DEFINEDSIZE];
myStruct array2[inputSize];

//computations

//output
out[index].info = computedNumber; [/indent]
}

I’ve tried several different ways of doing this on the device functions, and they work in EmuDebug, but when I run it in debug the program will crash when I try to copy the output array (which was declared using cudaMalloc and passed in to the kernel) back into the host’s memory. I’m fairly sure the memory is getting corrupted while it’s on the device, but I can’t figure out how.

device void func( unsigned int index, myOutputStruct* out, unsigned int inputSize){[indent]
//arrays for local computation
myStruct* array1[DEFINEDSIZE];
myStruct array2[200];

//computations

//output
out[index].info = array2[0].iter; [/indent]
}

// I tried the below based off the documentation, but it doesn’t seem to work any better
extern shared SearchElement d_states1;
device void func( unsigned int index, myOutputStruct* out, unsigned int inputSize){[indent]
shared myStruct* array1[DEFINEDSIZE];
SearchElement* d_states0 = (SearchElement*)d_states1;
SearchElement* d_states= (SearchElement*)&d_states0[200];[/indent]

//blah

Both of those print the same error,
“Cuda error in file ‘main.cu’ in line 239 : unspecified launch failure.”, then the program immediately quit.
line 239 is on the host,
CUDA_SAFE_CALL( cudaMemcpy( h_ouput, env.d_output, sizeof(myOutputStruct)*numberOfCalls, cudaMemcpyDeviceToHost));

where env.d_output is out in the func.

Any help you guys can give will be appreciated, I’ve been stuck on this for a couple weeks.

Thanks
Mark Henderson

It looks like how you’re using local memory is fine. And nothing stands out to me in your use of shared memory either.

One thing that did catch my attention is that you’re using a struct that has pointers to its own type, which suggests some sort of linked list or similar structure. Linked structures of this nature are generally difficult to transfer back and forth from host to device because host pointers are invalid on the device and vice versa. This is also a common source of errors that vanish in emulation mode, but persist in non-emulation mode.

Without more detail I can’t see anything actually wrong, but that’s my guess.

One option is instead of pointers, use indexes relative to a big pool of objects. Then the entire pool can be transferred from host to device and back, and all the links will remain intact.

I tried something similar and had a Assertion Failed! error in my dbhead.c Expression:_CtrIsValidHeapPointer(pUserData)
I assume its because transferring the elements from the device to the host with the linked list in the struct the pointers get messed up.
Is there an easy way to dump the struct elements to an array, or just program that in the struct from the start?