Heap Memory Allocation in Kernel Question


the document CUDA_C_Programming_Guide.pdf (Site 135) describe how to allocate memory with malloc/free in kernel functions(global).

Example from there is:

__global__ void mallocTest()


char* ptr = (char*)malloc(123);

printf(“Thread %d got pointer: %p\n”, threadIdx.x, ptr); 



void main()


// Set a heap size of 128 megabytes. Note that this must 

// be done before any kernel is launched. 

cudaThreadSetLimit(cudaLimitMallocHeapSize, 128*1024*1024); 

mallocTest<<<1, 5>>>();



I use the same example but —> error: calling a host function from a device/global function is not allowed

Since Toolkit Version 3.2 is it possible to allocate memory with malloc.

From Changelog Toolkit 3.2 :“Support for memory management using malloc() and free() in CUDA C compute kernels”

What am I doing wrong?

Do you have a compute capability 2.0 or 2.1 card (ie. Fermi), and are you passing a code generation option to nvcc to build for compute capability 2.0 or 2.1?

I have the following devices.

Device 0: “Tesla C1060”
CUDA Capability Major/Minor version number: 1.3

Device 0: “GeForce 310M”
CUDA Capability Major/Minor version number: 1.2

No device with Capability 2.0 or higher.

Is there an alternative for allocate memory in the kernel, for compute capability <2.0?


The compute capability from NVIDIA´s Tesla C2050 is 2.0. Which option is needed to compile the code with capability 2.0?

If you don’t have a compute 2.x device (and you don’t), memory allocation in device code is not supported.

To compile for compute 2.0, pass -arch=sm_20 to nvcc.

Ok thanks for your Time. I will try to compile my code this week on a NVIDIA´s Tesla C2050 card.