Is it ok to call cudaMalloc from inside a kernel? I need to allocate memory for each of my kernel threads and I was wondering if it is ok to use cudaMalloc or is there a better/faster way.
I would normally just give this a try but the problem is I am working on a PC that does not have a CUDA enabled card :(
for (int k = 0; k < sizeZ; ++k)
{
// Some processing
memset(myArray, 0, totalSize*sizeof(bool));
}
}
[/codebox]
Now, this does not translate easily into the kernel, unless each thread has access to some exclusive memory. I guess I have to create one massive array and give each thread an offset into it…