I have declared a block of linear memory allocated by cudaMalloc() for use in the kernel, because the exact required allocation of the array is not known at declaration time in the host code, I have basically allocated a “safe” large number for the array.
After filling the array with the required data in the kernel, I would like to “resize” it to the amount it is actually using.
I know a resize operation is probably not possible, but I’m wondering if there are any cuda functions that allow to do this efficiently?
I need to keep the linear array in the same address in memory, so I’m absolutely trying to avoid creating a smaller C style array and assigning it to that.
Hmm…ok let me try rephrasing the problem. Basically the kernel function is doing the simple task below:
for certain conditions on idx (e.g. idx is even…)
output[idx] = input[idx]
Obviously this code results in holes in output. So what I am doing now is to __syncthreads() after the assigning is done among all threads,
then literally iterating through output and shifting values back if a null index/“hole” is encountered.
-There has got to be a better way of doing this! Would anyone be able to help me with this?
-Also, I think the above method would fail if computation is done across several blocks, since syncthreads syncs threads only within a block. So how would I sync threads across all blocks to make sure I don’t rearrange the output until all threads are completed?
-In the example above, the output[idx] = input[idx] obviously is done over every thread, but the code for filling in the holes does not need to be done for every thread, it just needs one execution after the output is fully written. What is the right thing to do in this event?
Thanks in advance!
Edit:: From reading around, I feel this problem has some connection to atomic functions, but I’m not sure how to use atomics since I can’t rely on block shared memory here, as the input array may be extremely large and spread over many many blocks.
Perhaps you could fill an array with 1’s and 0’s based on whether the value is an item or a “hole”. Then using a parallel prefix sum, you could calculate for each item where it’s packed location would be. Then the packing can happen in parallel.
That’s a great technique, thank you!