How to create a dynamic size array in device?

Hi all,

I have a question about how to create a dynamic size array in the kernel function of device-side code? What I want is to create an array in thread local memory.

I tried to use cudaMalloc(…), but the compiler says:
“calling a host function from a device/global function is only allowed in device emulation mode.”

I wonder if I can achieve the above thread local memory allocation in device-side code?
If I can’t, is there anyway I can allocate thread local memory in host-side code?

Thanks for looking!


In short: no you can’t

Only global memory is accessible from the host and the amount of (local/shared) memory used by a thread is fixed. See the programming guide for details on the different types of memory and what you can and cannot do with it.

You can dynamically (that is at kernel call time, not from the kernel itself) allocate shared memory. See sections and 4.2.3 in the Programming Guide (1.1)

Thanks guys. Maybe I can’t do that for each thread. Even shared memory is accessible for all threads in a block according to the specification.

It’s trivial to allocate local memory on the device or host. You just have to do it yourself :) E.g., you can allocate a bunch of local memory upfront and then write a simple allocator to use inside the kernel. Or you can allocate a dynamic amount of global memory on the device and then index into it by threadIdx (and you can even use c++ operator overloading to make it all transparent).

So, How do you allocate local memory upfront?

I meant statically. Just get a big 'ol chunk. Unfortunately, as I wrote in the other thread, that can kill performance if you get into the kilobytes for some unfathomable reason.