So I have an algorithm with the following characteristics:

A is an input array that is blocked into N chunks, giving a chunk-size of S = sizeof(A)/N. Each chunk may be processed independently, but will generate somewhere between 0 and S^2 results, depending on the data distribution in A. These results have to be eventually packed together in order. So the results R[i] corresponding to chunk A[i] must immediately follow the results R[i-1].

My current approach is to allocate enough room for N * S^2 results upfront, and then do a single pass over the input array to generate both the results as well as a histogram. Another pass scans the histogram to determine output indices, and a final pass gathers the results together.

This incurs a huge space penalty as the chunk-size needs to be around 256 or 512 elements (ints) to maximize bandwidth. For the average case, this incurs a 256x or 512x space overhead, which limits the size of the dataset that can be operated on.

I’m currently trying to decide between doing two passes over the dataset to first determine the result size and allocate only enough space to hold all of the results. I was wondering if anyone may have had better luck using dynamically allocated chunks of memory?

EDIT: Another relevant point is that the computation to generate the results is compute bound, somewhere around 3-4x more than memory bound, so I am trying to avoid a doing it twice.