How can we allocate memory dynamically in __device__ functions?

Hi ,

I have a problem in allocating memory in device function dynamically…
See the below code…

device void Device_Fun( args… )
// Here I want to allocate the LARGE amount memory dynamically

global void Kernal_Fun( args… )
// calling device function…
Device_Fun( args … )

// Kernal call…
Kernal_Fun<<<16, 256 >>>( args… )

I already used

extern __shared__ char chArray[];

and used it inside the device function, but it is efficient/working for small amount of memory.

  • So What is the optimum way to allocate dynamic memory in device function?
  • What is the maximum memory we can allocate dynamically in device function?

Please help Me!


I believe the only way to do it is to allocate a large block on the host using cudaMalloc, pass it to the kernel, and then manually carve out chunks for use within your code. With the potential synchronization issues on device memory, this is not a trivial task.

It would be nice if someone had already solved this, and published some basic CUDA heap management routines…

I don’t think that you can allocate memory inside a device function; that is to say, there’s no support for a C++ style “MyObject o = new MyObject();” sort of allocation.

Also, where (global, shared, constant) are you trying to allocate this memory, and why do you need to do it from the device code?