So I have an algorithm with the following characteristics:
A is an input array that is blocked into N chunks, giving a chunk-size of S = sizeof(A)/N. Each chunk may be processed independently, but will generate somewhere between 0 and S^2 results, depending on the data distribution in A. These results have to be eventually packed together in order. So the results R[i] corresponding to chunk A[i] must immediately follow the results R[i-1].
My current approach is to allocate enough room for N * S^2 results upfront, and then do a single pass over the input array to generate both the results as well as a histogram. Another pass scans the histogram to determine output indices, and a final pass gathers the results together.
This incurs a huge space penalty as the chunk-size needs to be around 256 or 512 elements (ints) to maximize bandwidth. For the average case, this incurs a 256x or 512x space overhead, which limits the size of the dataset that can be operated on.
I’m currently trying to decide between doing two passes over the dataset to first determine the result size and allocate only enough space to hold all of the results. I was wondering if anyone may have had better luck using dynamically allocated chunks of memory?
EDIT: Another relevant point is that the computation to generate the results is compute bound, somewhere around 3-4x more than memory bound, so I am trying to avoid a doing it twice.
Can’t you pack each chunk’s results immediately after processing them? Assuming each chunk is processed by one Cuda block, you then only need space for N[sub]blocks[/sub] * S[sup]2[/sup] at a time (where N[sub]blocks[/sub] is the number of concurrently executing blocks). The blocks would have to synchronize between result generation and compression/writeout in order to exchange the actual sizes of their generated output.
This is a possible solution. I’ll try playing around with this to see if the number of blocks needed to get a decent throughput can be made large enough to outweigh the synchronization overheads. With about 100MB of scratch space, and a block size of 256 ints, it should be possible to handle about 512 blocks at a time, which seems promising.
As a followup. Tera’s suggestion resulted in a bookkeeping kernel going from 0.931896% to 9.77865% of the total runtime, as it must be called every iteration. However, this seems worth it to me for being able to handle very large datasets.
So the kernel takes 10x as long now? Then your own suggestion of two passes would be better then. Guess I misunderstood you somehow.
You could of course reduce the synchronization overhead by having two buffers per thread and processing a second chunk while waiting on the sizes for the first one.
Probably not worth doing for less than 10% overall gain though.
Yes, this bookkeeping kernel now contributes more significantly to the execution time of the complete application. However, it is still a relatively insignificant portion of the total runtime. There are several other kernels that make up one iteration of the computation (the multiple kernels are needed for the global barriers).
The gather kernel takes close to 15% of the runtime, an initial partitioning kernel takes about 3%, a scan kernel that takes about .2%, and the main kernel takes the remaining ~73%. The algorithm itself is memory bound (although the access pattern is generally not memory friendly), and right now I am at about 30-40% of peak memory bandwidth with all global memory transfers included and relatively readable (not unrolled) code.
I am very happy with this as my CPU implementations makes less efficient use of memory even with a sequential algorithm (it seems to be hampered by a short load to use to branch distance that destroys ILP/MLP).
I think that with further tuning I could get close to 60-70% efficiency by making the code more complicated (less readable) by unrolling loops and merging related computations, and by pushing more work into the initial partitioning kernel (which makes the subsequent operations more regular and limits the working set size further).
After this I’ll consider double buffering as you suggest if the per-iteration overhead becomes more significant, but I want to wait to see if this is counteracted by improving the partitioning kernel.
Synchronization between CPU thread/GPU kernel via zero-copy mechanism is “not” encouraged by NVIDIA (as per Tim) - although __threadfence_system() should really help (FERMI cards)
If that is possible, the GPU blocks can write to host-memory location - which is intercepted by CPU thread that allocates memory and passes it to the block which writes out and completes its execution…
But obviously the final result is fragemented though…
I was originally considering doing something along the lines of preallocating a pool of memory pages and then having CTAs allocate a new page via an atomic increment. They would then write their results to this page until it is full and then try to allocate another page. It would also be relatively easy to detect overflow by having each cta write out a flag if the atomic increment fails by passing a threshold, but this would really be an out-of-memory error as I don’t care to support out of core operations over PCIe yet.
Eventually the gather kernel could be used to reorder to results by walking the list of pages for each CTA. If the pages were made large enough I think that the performance overhead of this could be made negligible.
This might be fun to play around with later, but the compaction solution works well enough for me now.