dynamic memory creation in kernel?


is there a way of dynamic memory allocation in a kernel function?

__global__ void myKernel() {


	float* a;

	cudaMalloc((void**)&a, length * sizeof(float));



yields the following error: Calls are not supported (found non-inlined call to cudaMalloc)

The same with malloc. Am I trying to do something stupid?

I circumvented the problem temporarily by creating all stuff from the host on the device (cudaMalloc, cudaMemcopy) and passed the pointers to the kernel. That worked but creating n-threads structures on the host and copying those to the device is a bit unsatisfactory.

Any suggestions?



I believe you cannot allocate memory with the kernel code with memory routines. What you can do, is allocate shared memory for each kernel instance when the kernels are being invoked (prior to their execution). So, it’s a compromise as far as dynamic allocation goes - you must know how much memory each kernel (or, more precisely, all kernels together) needs to specify it in the call to the kernel (check the programming guide section on invoking kernels from host code).

You will declare pointers in the kernel in a way similar to this:


extern __shared__ float p[];


There are more details on this in the documentation. I think it will solve your performance problem (at least partially), since now you can have kernels copy the data from global memory into their shared memory, which is much faster.