I’m quite new to CUDA & this forum, but I’ll still try to explain properly my question.
I have a couple of functions
device double myfunction(arguments)
double tt = (double*)malloc( npoints*sizeof(double));
who are executed per-thread with the main kernel call.
Each of these functions also calls a few sub-functons , in total amounting to 5 malloc calls per thread, each array of ~50 elements.
I run 256 simulations (MC-based) in parallel on the card:
with Grid Configuration 256 blocks & 256 threads per block (so only 1 block is executed)
I get out of memory (i think so), in debug mode - i see that some threads just get ‘stuck’ at the malloc() call and dont finish.
in other way: 256 blocks, 1 thread per block -> everything works fine.
This problem only appears when i use my written “extension” with the malloc’s, the rest of the program runs fine in any case.
as I’ve read on this forum (as there is hardly any documentation) malloc() allocates global memory. The overall memory problem is weird - as i have access to the 2050C tesla card with 6GB memory - the total amount of memory I’m allocating per thread * 256 ~= 500KB
Do you have any ideas why i might be crashing in one case, but not the other?
Or ideas how to trace the memory problem.