Question Dynamic Memory Allocation in the kernel function


I am new to CUDA and I would be highly obliged if you guys could take some time out to answer my query. Am I allowed to use malloc in a cuda kernel function?

I need to dynamically allocate memory in my device code. I am using cudaMalloc and cudaMemcpy in my host code so that the host can transfer the input to the device. I am also allocating memory for the output from my host code using cudaMalloc. But how do I allocate memory for the " working space" for each thread?

How do I dynamically allocate per-thread local memory?


At the moment, you can’t. CUDA doesn’t support any dynamic memory allocation inside kernels. Only host side code can dynamically allocate device memory, and even then only global and shared memory.

You could allocate a large pool of device memory on the host and then define your own device malloc and free functions that allocate out of your pool if you wanted. You would have to deal with concurrent access from multiple threads, but it should be possible using atomics or static partitioning.

I personally don’t like the design practice, but you could try it out.