cudaMalloc from inside a kernel


Is it ok to call cudaMalloc from inside a kernel? I need to allocate memory for each of my kernel threads and I was wondering if it is ok to use cudaMalloc or is there a better/faster way.

I would normally just give this a try but the problem is I am working on a PC that does not have a CUDA enabled card :(

Many thanks,


Kernels can’t dynamically allocate memory. All of the CUDA runtime API functions are host functions only.

no, you cannot call cudaMalloc inside any kernel.

just allocate device memory from host code,

the following code comes from programming guide

[codebox]// Device code

global void VecAdd(float* A, float* B, float* C)


int i = threadIdx.x;

if (i < N)

C[i] = A[i] + B[i];


// Host code

int main()


// Allocate vectors in device memory

size_t size = N * sizeof(float);

float* d_A;

cudaMalloc((void**)&d_A, size);

float* d_B;

cudaMalloc((void**)&d_B, size);

float* d_C;

cudaMalloc((void**)&d_C, size);

// Copy vectors from host memory to device memory

// h_A and h_B are input vectors stored in host memory

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

// Invoke kernel

int threadsPerBlock = 256;

int blocksPerGrid = (N + threadsPerBlock – 1) / threadsPerBlock;

VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C);

// Copy result from device memory to host memory

// h_C contains the result in host memory

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

// Free device memory





Thanks for the reply.

This is a problem though. I have the following:


bool * myArray = (bool *)(malloc(someSize));;

for (int i = 0; i < sizeX; ++i)


for (int j = 0; j < sizeY; ++j)


    for (int k = 0; k < sizeZ; ++k)


         // Some processing

         memset(myArray, 0, totalSize*sizeof(bool));




Now, this does not translate easily into the kernel, unless each thread has access to some exclusive memory. I guess I have to create one massive array and give each thread an offset into it…