OnDevice malloc() , dependence on block_dim

Hi All,

I’m quite new to CUDA & this forum, but I’ll still try to explain properly my question.
I have a couple of functions

device double myfunction(arguments)
{
<…>
double tt = (double*)malloc( npoints*sizeof(double));
<.OPERATIONS.>
return …
}
who are executed per-thread with the main kernel call.
Each of these functions also calls a few sub-functons , in total amounting to 5 malloc calls per thread, each array of ~50 elements.

Problem:
I run 256 simulations (MC-based) in parallel on the card:
with Grid Configuration 256 blocks & 256 threads per block (so only 1 block is executed)
I get out of memory (i think so), in debug mode - i see that some threads just get ‘stuck’ at the malloc() call and dont finish.

in other way: 256 blocks, 1 thread per block -> everything works fine.
This problem only appears when i use my written “extension” with the malloc’s, the rest of the program runs fine in any case.

Question:
as I’ve read on this forum (as there is hardly any documentation) malloc() allocates global memory. The overall memory problem is weird - as i have access to the 2050C tesla card with 6GB memory - the total amount of memory I’m allocating per thread * 256 ~= 500KB
Do you have any ideas why i might be crashing in one case, but not the other?
Or ideas how to trace the memory problem.

Thanks,
Vytautas

Hi All,

I’m quite new to CUDA & this forum, but I’ll still try to explain properly my question.
I have a couple of functions

device double myfunction(arguments)
{
<…>
double tt = (double*)malloc( npoints*sizeof(double));
<.OPERATIONS.>
return …
}
who are executed per-thread with the main kernel call.
Each of these functions also calls a few sub-functons , in total amounting to 5 malloc calls per thread, each array of ~50 elements.

Problem:
I run 256 simulations (MC-based) in parallel on the card:
with Grid Configuration 256 blocks & 256 threads per block (so only 1 block is executed)
I get out of memory (i think so), in debug mode - i see that some threads just get ‘stuck’ at the malloc() call and dont finish.

in other way: 256 blocks, 1 thread per block -> everything works fine.
This problem only appears when i use my written “extension” with the malloc’s, the rest of the program runs fine in any case.

Question:
as I’ve read on this forum (as there is hardly any documentation) malloc() allocates global memory. The overall memory problem is weird - as i have access to the 2050C tesla card with 6GB memory - the total amount of memory I’m allocating per thread * 256 ~= 500KB
Do you have any ideas why i might be crashing in one case, but not the other?
Or ideas how to trace the memory problem.

Thanks,
Vytautas

Too little data to answer question… need more code or so.

Too little data to answer question… need more code or so.

If you don’t also free() the memory inside the kernel, you are allocating memory for each thread in each block, i.e., 256× more memory than you expected. And you can’t (by default) allocate the whole 6GB, just the memory set aside for allocations from the device side. So you might need to call cudaDeviceSetLimit(cudaLimitMallocHeapSize, …) to make sure there is enough memory available on the device’s heap.

If you don’t also free() the memory inside the kernel, you are allocating memory for each thread in each block, i.e., 256× more memory than you expected. And you can’t (by default) allocate the whole 6GB, just the memory set aside for allocations from the device side. So you might need to call cudaDeviceSetLimit(cudaLimitMallocHeapSize, …) to make sure there is enough memory available on the device’s heap.