A common question in the forums is how to deal with threads that produce variable amounts of output. Handling this efficiently has always been tricky because most solutions break one of two guidelines for good CUDA performance: maximize the fraction of global memory transactions that are coalesced and minimize inter-thread communication. Various solutions include:
Allocate enough global memory for each thread to hold the maximum number of possible output elements and then perform some kind of compaction step in a second kernel. Avoids inter-thread communication but still hard to coalesce in some cases.
Dump the variable output from each block temporarily into shared memory and coordinate the writing of it to global memory in a coalesced way. This pushes the inter-thread communication to where it is fastest and also ensures coalesced writes of the final results. Difficult for some problems, especially if you need shared memory for something else.
Build an output queue: allocate an array with enough slots for the maximum output from the kernel and initialize an unsigned integer in device memory to zero. The integer is the index of the next available output slot. When a thread needs to write an output element, it atomically adds 1 to the slot counter and writes its output to the slot number returned atomicAdd(), which is the value of the counter before the addition. This approach is doubly bad, especially if the probability of output from each thread is high: it uses atomics for frequent inter-thread communication and the writes are almost guaranteed to never be coalesced.
(Edit: Fixed link to right file.)
I wanted to see how #3 improves with the new L2 cache. The test code takes an array with 16M consecutive integers and writes the odd ones to the output queue. ( http://bitbucket.org/seibert/fermi_test/src/tip/queue.cu , compile with -arch sm_11) The printed MB/sec is the total amount of data read and written (1.5 * 4 bytes * 16M) divided by the kernel run time. Here are the results from a GTX 275 and GTX 470, which have device-to-device bandwidths within 10% of each other:
Device name: GeForce GTX 275
BogoGFLOPS: 699.8
Size of array: 16776960 elements
8388480 items put into ouput queue.
Output Queue: 1929.249 ms, 49.8 MB/sec
Device name: GeForce GTX 470
BogoGFLOPS: 1088.6
Size of array: 16776960 elements
8388480 items put into ouput queue.
Output Queue: 188.675 ms, 808.9 MB/sec
That’s a 16x difference in processing rate, though the GTX 470 is still a factor of 110 below the device-to-device bandwidth measured for the card. If someone wants to implement the same test, but using stream compaction in global memory or some kind of output queue in shared memory, I’d be curious to see how far below optimal an output queue is for this scenario.
Argh, this is what I get for writing microbenchmarks on the couch. :) I linked to the source file for the wrong test. It is fixed now. The file is queue.cu.