CUDA in-kernel malloc

eweill · July 6, 2011, 6:07pm

I have narrowed down the problem in my code to the malloc statements in my kernel. They are not giving an error, but the values of other variables that are in the kernel are changing due to, what I suspect, is memory corruption from using too much of the heap. I have the cudaThreadGetLimit call in my code which returns 8MB. My kernel call looks as follows:

dim3 dimGrid (100,100);
dim3 dimBlock (1,1);
kernel <<< dimGrid, dimBlock >>> (…arguments…);

So I want 10000 threads (just trying to make it simplistic with the code I am working with). Inside the kernel, there are two places where there are malloc’s. The first allocates 2 char sequences (with the max of this being 500 chars each) and a matrix that is maximum 500*500 ints. By my calculations thats less than the limit given by cudaThreadGetLimit. Am I looking at this incorrectly? Is that value telling me something different than I am thinking? Does this 8MB in fact mean per thread or does it mean maximum memory that can be allocated by all threads together. Thanks for the help. I am a beginning CUDA programmer.

hyqneuron · July 8, 2011, 10:31am

Do you deallocate the memory when thread exists?

Goldenyz · July 18, 2011, 7:21am

I have another question?
Can we malloc memory in kernel?
Do you have any example about this?

njuffa · July 18, 2011, 8:09am

See section B.15 of the CUDA 4.0 Programming Guide:

http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf

alrikai · July 19, 2011, 2:50am

If you’re using smallish-amounts of memory per block in your kernel (16KB or 64KB, depending on your card), it’s much better to use shared memory (in a nutshell, it’s faster). With that said, if you’re using CUDA 3.2 or higher then you can use both malloc and free in device code.

I’d advise checking out the SDK examples

Topic		Replies	Views
malloc in kernel CUDA Programming and Performance	3	1027	September 20, 2011
Relation between # of blocks and devicememory size questions about blocks and memory CUDA Programming and Performance	3	1778	July 23, 2008
kernel malloc() efficiency really bad CUDA Programming and Performance	3	8266	January 18, 2011
Maximum number of instruction inside a Kernel CUDA Programming and Performance	9	2814	October 13, 2009
malloc can't allocate more than 8Mb from the __device__ function, 6Gb available. CUDA Programming and Performance	4	1569	February 13, 2015
Problems with local memory CUDA Programming and Performance	3	785	April 22, 2016
kernel malloc() capacity limited? can only malloc 88K blocks, more malloc() will fail CUDA Programming and Performance	2	6112	January 15, 2011
Unable to allocate more than 2MB using malloc in CUDA kernel CUDA Programming and Performance cuda , kernel	4	1463	April 8, 2020
cudaMalloc from inside a kernel CUDA Programming and Performance	3	12686	September 2, 2009
max thread per block and memory device question CUDA Programming and Performance	2	17000	January 9, 2009

CUDA in-kernel malloc

Related topics