Most efficient way to transform X elements to Y elements

Since this problem has probably been encountered before I have to ask what the most efficient way to accomplish this algorithm would be. Here are the basics of my requirements:

  1. I start with a linear array of X elements. We’ll call this the “input” array. Each elements is exactly 128 bytes and I usually start with around 50,000 of them.
  2. Each element is processed and in turn converted into 0, 2, or 4 new elements of the exact same size and type (128 bytes each). The element processing is completely independent of one another. The new elements are written to what I will call the “output array”. The order in which the elements are written is irrelevant.
  3. After processing all X elements, the output array is used as the input array and the process is repeated as necessary until 0 elements are left.

The main problem is how to write to the next available location in the output array when I process each element. The most workable solution I can think of is to use on 32bit integer counter for the next available location in the output buffer in global memory and use AtomicAdd on it from each thread to reserve the next 2 or 4 elements before it writes them to the buffer (something like int x = AtomicAdd(outCounter, newElements); out = …, out[x+1] = … ). This would result in every single thread issuing an AtomicAdd to the exact same address once per elements. It is difficult to find any kind of concrete numbers on how bad this would be without trying it (I haven’t implemented this yet because I have an 8800 gtx and I am waiting for my gtx285 to arrive in the mail :) but from what I can find on these forums the general consesus is that this would probably be undesirable.

Another solution would be to simply have each thread only write to multiples of 4 elements. This would waste progressively more memory capacity and bandwidth the more times 0 and 2 elements are outputted instead of 4 but would have the advantage of not requiring any synchronization at all between threads. The disadvantage is that when iterating over the list again I will have many null elements that threads will waste time on, and additionally as mentioned before a lot of storage space in that output buffer will be wasted. This is a large problem because it needs to store at least 2 million elements, or 256MB. This would result in 512MB allocation total for both input and output buffers (since the algorithm flips which one is used as input and output alternating). Assuming I’m outputting 0, 2, and 4 elements with equal average frequency this would result in 50% efficiency for memory storage, memory bandwidth, and processing power. This would also require 1GB of memory for both buffers which is unfeasible on the vast majority of CUDA cards, including mine.

The last theoretical solution I can think of is for each thread-block to maintain it’s own smaller input and output buffers and use AtomicAdd on shared memory to maintain the output element counter. The AtomicAdds would be faster in this case (as is my understanding anyway) but this would only work assuming a perfectly ideal even distribution of the number of output elements between thread-blocks. For instance, it is possible that all the threads in threadblock 0 could output 4 elements and all the threads in threadblock 1 output 0. This again would require more memory than I can possibly allocate on the card since the 2 million element capacity estimation is the total number of REAL elements that need to be stored. Unused element space for each threadblock’s buffers would result in a total capacity requirement higher than the number of actual elements outputted.

Performance is critical for me because this will be executed in realtime.

So I beseech the CUDA Gods! Anyone have any ideas on the best way to accomplish something like this?

atomicAdd will be insanely slow for this application.

What you need is a variation on the stream compaction algorithm. Search on that for more info, but here is the gist:

Perform a first pass: determine the number of elements written by each thread and write that out.
Perform a additive scan on that array. The result is now the indices each thread should write to.
Perform a 2nd pass and write the actual output to the starting indices indicated by the scanned array.

Or, another option is to do the following: the tradeoffs will depend on the extra memory bandwidth expense vs the extra computation expense of doing the calculation in 2 passes.

Run your algorithm and write out 4 elements in each thread as you said.
Run a standard stream compaction to remove null elements:

For the stream compaction algorithm, unfortunately, I don’t know how many elements a thread will output until it has done virtually all of it’s calculation on it’s input element, so this method would be worse than 50% efficient.

For the latter, I would be compacting a 256MB stream with 2M elements on every iteration which would be extremely painful as well for performance. In addition to a lot of other operations I need to try to execute at least 800 iterations every second.

I thought of another solution that uses the least number of atomicadds possible. Each thread put it’s number of outputted elements into shared memory (0, 2, 4). With 64 elements being processed (1 per 4 threads) in each block, I perform a quick reduction on the data that while in the process computes not only the total sum, but replaces each number with the cumulative sum of preceeding elements. I then use atomicadd only once for the entire threadblock to reserve space in the output array (1 thread of 256 executes this instruction for each element) for the total number of elements generated by that threadblock for 64 input elements, then each respective thread uses the base offset returned by the atomicadd + the relative offset which was generated by the reduction in shared memory. Do you think this is still too much?

I guess now might be a good time to ask: If you had 256 threads and 64 numbers in an array x that are either 0, 2, or 4, what would be the fastest way to transform the numbers such that x[i] = x[0] + x[…] + x[i]?

My intuition says probably, but this is getting close to the gray area. You would have to try it to find out.

I have had success in the past where I did a type of per block reduction to reduce the number of atomicAdds needed. But I had several thousand different memory locations where the atomicAdds were spread around and you have just one… Bah, it is impossible to tell without trying. At least as I see it, this one atomic add per block idea could have promise.

A scan in the shared memory within the thread block.

Do you need to have the intermediate results of each iteration? It sounds like you do, since you call it an output array.
If you don’t, if you’re somehow reducing all the values down and boiling them to a single value or “best match” or something, maybe you
don’t need to iterate the way you’re thinking… instead of processing element x[0] and getting new Y Y’ to insert and compute later,
you could skip the output arrays entirely and use a shared memory stack! so you chug on X[0] and get Y and Y’. Push them on your stack,
then pop off a new work (y’) and transform THAT and potentially push Z and Z’ or maybe nothing. Repeat until your stack is empty.

Such an algorithm may be completely impossible for your app, but it may be quite useful for others. For example, when I made a strong-probable-prime search algorithm, a successful test would create new tests to validate, and at the end I was basically looking for the longest chain of tests, and the stack method worked fine. I had a “work array” of 2^32 values but my final output was literally one integer, the “best” of the work values.

Tell us more about your higher level application and maybe more approaches would be possible.

I didn’t mention it because it isn’t relevant to the problem but every time I get an element that I don’t output (or 0 output elements for that element) I am doing other processing with that element’s data so this isn’t just some massive reduction computation. What I am actually doing when I am running through the elements in one buffer and outputting them to the other then switching is implementing a form of flattened recursion for what would otherwise be a recursive algorithm.

Since inquiring minds will want to know anyway, the elements I am talking about are polygon quads which consist of 4 32byte vertices (128bytes per quad) in viewspace. Processing an input quad involves sampling a displacement map texture at each vertex and displacing each vertex by it’s displacement in screenspace, then doing the projective divide, measuring it’s screenspace size along the u and v axis’s (not in the texture coordinate sense but the geometry sense) and then checking the u and v sizes against a threshold used to determine whether or not to split the quad along the u axis, v axis, or both. If it is smaller than the threshold along both the u and v axis I am filling the quad into a framebuffer containing the the depth and u and v (in the texture sense this time) coordinates. This has the effect of basically taking a low resolution quad mesh and a displacement map and effectively tesselating it down to quads that are roughly 4 pixels in size, no more, no less, and rendering them as I go. Normally this would be implemented with recursion but obviously that is a no-no.

Keen observers will note this is basically a somewhat simplified implementation of REYES sans recursion, which I believe is mostly responsible for the slowness that has relegated this style of rendering to offline in the past (that and the lack of something like CUDA!). I have implemented this algorithm in C using openmp for CPU parallelization and achieved reasonably usable FPS (15 to 20) at 640x480 rendering several 1 polygon per pixel meshes. The idea is that this kind of rendering is a form of “perfect” LOD such that if overdraw were minimized, my frame rate would only be based on my screen resolution and completely independent of scene geometry complexity, even for extremely complex scenes that if you fully tesselated and displaced the input meshes would result in billions or even trillions of polygons. Obviously since my memory bandwidth and floating point power are higher on the GPU I am optimistic about achieving what I want at very realtime frame rates using CUDA, especially considering the extremely massive amount of parallelization potential here (if I had enough cores, each of operations in each of the 5 iterations could be executed completely in parallel). The final result will be paired with a deffered shading and lighting algorithm to complete the final image. Examples of the kind of geometric detail achievable through displacement mapping and low resolution meshes can be found if you look up ZBrush models.

Though it isn’t a very good example because my CPU version never implemented any kind of lighting or quality shading, these 2 screenshots show cylinders that are only 32 polygons each, only 16 of which are facing the camera, so 32 polygons total.

Nobody has any better ideas than this? Surely having all threadblocks build one dynamically sized data structure is something people have tackled before.