Howdy,
BACKGROUND:
I’m doing some fairly complex computation in my kernel calls, and I need to give each thread it’s own space in memory.
I’ve been doing this by allocating a large array and pre computing locations in that array for each separate execution to use, but when I’m doing 10,000 computations, giving them each their own computation space when less then 100 will execute at a time is rather wasteful.
To make the problem even more annoying, some of the threads need more memory space then others, though the vast majority will be a set number I can input at when the host calls the kernel function.
PROBLEM:
I’m trying to use shared memory arrays allocated at the start of a device function. Each computation will call this function exactly once from the global function.
on a host function, I would call:
#define DEFINEDSIZE
struct myStruct{[indent]
int x;
int y;
unsigned int iter;
myStruct* pointerToStruct;[/indent]
}
void func( unsigned int index, myOutputStruct* out, unsigned int inputSize){[indent]
//arrays for local computation
myStruct* array1[DEFINEDSIZE];
myStruct array2[inputSize];
//computations
//output
out[index].info = computedNumber; [/indent]
}
I’ve tried several different ways of doing this on the device functions, and they work in EmuDebug, but when I run it in debug the program will crash when I try to copy the output array (which was declared using cudaMalloc and passed in to the kernel) back into the host’s memory. I’m fairly sure the memory is getting corrupted while it’s on the device, but I can’t figure out how.
device void func( unsigned int index, myOutputStruct* out, unsigned int inputSize){[indent]
//arrays for local computation
myStruct* array1[DEFINEDSIZE];
myStruct array2[200];
//computations
//output
out[index].info = array2[0].iter; [/indent]
}
// I tried the below based off the documentation, but it doesn’t seem to work any better
extern shared SearchElement d_states1;
device void func( unsigned int index, myOutputStruct* out, unsigned int inputSize){[indent]
shared myStruct* array1[DEFINEDSIZE];
SearchElement* d_states0 = (SearchElement*)d_states1;
SearchElement* d_states= (SearchElement*)&d_states0[200];[/indent]
//blah
Both of those print the same error,
“Cuda error in file ‘main.cu’ in line 239 : unspecified launch failure.”, then the program immediately quit.
line 239 is on the host,
CUDA_SAFE_CALL( cudaMemcpy( h_ouput, env.d_output, sizeof(myOutputStruct)*numberOfCalls, cudaMemcpyDeviceToHost));
where env.d_output is out in the func.
Any help you guys can give will be appreciated, I’ve been stuck on this for a couple weeks.
Thanks
Mark Henderson