I believe the only way to do it is to allocate a large block on the host using cudaMalloc, pass it to the kernel, and then manually carve out chunks for use within your code. With the potential synchronization issues on device memory, this is not a trivial task.
It would be nice if someone had already solved this, and published some basic CUDA heap management routines…
I don’t think that you can allocate memory inside a device function; that is to say, there’s no support for a C++ style “MyObject o = new MyObject();” sort of allocation.
Also, where (global, shared, constant) are you trying to allocate this memory, and why do you need to do it from the device code?