Dynamic memory allocation during kernel execution Is it posible?

Hi everybody!!

I know that memory allocation for device memory, has to be dynamically done from host using the cudaMalloc() function and freed with cudaFree(), and I am in the understanding that CUDA doesn’t let you do this from the device (i.e. in kernels), the question is that if you guys know if somebody have work on this, you know, constructing a “kernelCudaMalloc()” or something like that :ermm:. ?. that can be called from the device while running a kernel…

CUDA doesn’t support memory allocation from inside kernels and I don’t think it can easily be done. I haven’t seen any evidence that any of the current CUDA capable GPUs actually has what might be described as an autonomous MMU, and as a result the GPU is reliant the host driver to do at least some of the memory management functions you need to be able to have a kernel call malloc() or similar.

Since now malloc() for kernel has been released (I found no literature declaring it only works for 2.x devices) and it seems Tesla C1060 (device 1.3) has a MMU (see this page), I wonder if there is any possibility for C1060 to dynamically allocate memory in kernel or device functions, although I didn’t find any literature supporting this, yet.

I have tried malloc(), when compiling by nvcc with no options, it gave error:
calling a host function from a device/global function is not allowed

When compiling using option -arch=sm_20 or -arch=sm_21, no error occurred. However, the running output was wrong, which is predictable since the device is 1.3 not 2.x.

So it seems malloc() is only available for device 2.x (although I see no literature saying this). Is it because of hardware difference between devices 1.3 and 2.x? But someone told me there is no real difference, so dynamic allocation should be possible on 1.3 devices. Is he correct?

Just in case, below is my system settings.

OS: Ubuntu 10.04 (64-bit)
CUDA device: Tesla C1060 (x2)
Driver: 260.19.26
Toolkit: 3.2.16
SDK: 3.2.16

Device-side malloc() requires sm_2x. Due to an oversight, this is not being pointed out by the CUDA 3.2 documentation. This documentation defect was brought to our attention a couple of months and should be fixed in the next release.

Thanks for reply. Anyone knows the answer regarding to possibility and/or hardware difference?

So, if I want to use the device-side malloc(), do I need to add the device, or just use malloc in the kernel?

Just include stdlib.h and call malloc. If the code generation options are correct, nvcc will overload the call with the device version of the function automagically.

As for hardware differences, it worth noting that my original reply in this thread was written 3 months before Fermi was announced. At the time of the announcement, NVIDIA made a big deal about redesigning the GPU memory controller to provide a unified “flat” address space that would allow device code malloc and the C++ new operator to work. This implies that GT200 memory controller wasn’t a full MMU in the conventional sense. The Fermi implementation of malloc still requires a host driver managed heap that must be sized and allocated before the kernel runs. That tells me that even in Fermi, the MMU isn’t completely autonomous and some kind of host driver managed memory map is still used.

The same by me. No error occurs, by compiling my code. But the execution of the executable goes wrong.

How about this?

Hi. Has anyone found a solution to this? I have the same problem when running my code, compiles fine with a malloc inside the kernel, no errors, but causes unexpected behaviour at runtime, even when the kernel containing the malloc is not executed . . .

Ta, Pete

Btw, I’m running on a Tesla C2070 on 64bit ubuntu

I was able to malloc the data in kernel and transfer back to host and check it.

#include “stdlib.h”
#define ALLOC_SIZE 128
global void
test_malloc(int **controller)
shared int ptr;
int bx = blockIdx.x;
if (threadIdx.x == 0)
ptr = (int
)malloc(ALLOC_SIZE * sizeof(int));
controller[bx] = ptr;
for (int idx = threadIdx.x; idx < ALLOC_SIZE; idx += blockDim.x)
ptr[idx] = threadIdx.x;

global void
test_free(int **controller)
int bx = blockIdx.x;
if(threadIdx.x == 0)

void runTest()
int block_num = 64;
int block_size = 32;
int **g_controller;
cudaMalloc(&g_controller, sizeof(int *) *block_num);
test_malloc<<<block_num, block_size>>>(g_controller);
int *h_controller[block_num];
cudaMemcpy(h_controller, g_controller, sizeof(int *) *block_num, cudaMemcpyDeviceToHost);
for (int i = 0; i != block_size; i++)
printf(“allocated pointer %p.\n”, h_controller[i]);
int buffer[ALLOC_SIZE];
cudaMemcpy(buffer, h_controller[i], sizeof(int) * ALLOC_SIZE, cudaMemcpyDeviceToHost);
for (int index = 0; index != block_size; index++)
assert(buffer[index] == index);

  test_free<<<block_num, block_size>>>(g_controller);


I try to compile the above code. But it produce compile time errors:

test.cu(19): error: calling a host function from a device/global function is only allowed in device emulation mode

test.cu(51): error: calling a host function from a device/global function is only allowed in device emulation mode

Can you please help me running this code?

How did you compile it? Specifically, for what compute capability? It sounds like you were targetting one that’s too old, you need 2.x or more recent for dynamic allocation on the device

Device-side malloc() requires compute capability 2.0 or higher (i.e. a Fermi or Kepler class GPU). BTW, you may want to update your CUDA version. The mention of device emulation in the quoted error message suggests you are using a CUDA version prior to 3.1. The current CUDA version is 5.0.