I have a question about how to create a dynamic size array in the kernel function of device-side code? What I want is to create an array in thread local memory.
I tried to use cudaMalloc(…), but the compiler says:
“calling a host function from a device/global function is only allowed in device emulation mode.”
I wonder if I can achieve the above thread local memory allocation in device-side code?
If I can’t, is there anyway I can allocate thread local memory in host-side code?
Thanks for looking!