Is it in any way possoble to make an array for each thread where I don’t know the size until I make the kernel call?
Simple answer: no.
But if the possible size ranges aren’t huge you could allocate a fixed buffer per thread and just use it and waste the unused part. That’s likely fine if you have something like a per-thread stack that never gets deeper than some limit like 1000.
You could do this in local memory or manually in device memory.
The even fancier version is you could make your own memory allocator. It gets ugly quickly, though it’s not too bad if you don’t need to worry about freeing memory.
The simplest alloc-only version is a simple global atomic index. You allocate a big big chunk of device memory, 100MB or whatever. Whenever a thread wants some private memory, you increment the atomic (by as much as you like depending on your runtime dynamic needs) and your thread “owns” that range of memory addresses pointed to by the return by the atomicAdd.
If you start playing games trying to support dynamic FREEING of memory, you’ve opened the door to the unpleasantness. It’s possible but not elegant, efficient, or simple. I used a global device-wide lock. It worked but it’s just not efficient. In my case the efficiency wasn’t important because usually I just needed one alloc per block so the slow speed didn’t hurt me.
Thanks for the help.
Actually I can make it quite small since the size is the height of a binary tree. so I’ll go for the fixed size.