Bug when using atomic operations in fragment program

GPU: Nvidia GeForce GTX 480
Driver: 340.52
Operating System: Windows 8.1

I wish I could condense the following. Unfortunately, the bug seems to stem from the complexity of the algorithm.

Background

I’m implementing a variant of order-independent transparency using per-pixel linked lists of fragments. The technique is based on [Lefebvre et al. 2014]. It proceeds in 3 passes:

  1. Clear: Buffers are cleared using a screen-aligned quad.
  2. Linked-list creation: Fragments are added to the per-pixel linked lists by rendering the scene geometry.
  3. Blending: Linked-lists are traversed using a screen-aligned quad. Fragments from the lists are blended together to form the final image.

The tricky part is to keep the linked lists sorted according to depth (so that the fragments blend together correctly). To do so, I use insertion sort in pass 2. It proceeds as follows (for each fragment):

struct oit_data {
	uint32_t next, compressed_diffuse;
	float depth;
};

coherent restrict layout (std430) buffer data_buffer
{ oit_data data[]; };

layout(binding = 0, offset = 0) uniform atomic_uint count;

layout(pixel_center_integer) in uvec2 gl_FragCoord;

uint32_t allocate() { return window_dimensions.x * window_dimensions.y + atomicCounterIncrement(count); }

#define MAX_ITER 2048

void main()
{
	uint32_t compressed_diffuse = /* just some diffuse data */;
	float depth = /* negated depth in eye (view) coordinates. E.g., in the range [z_near, z_far]. */;

	// Calculate indices
	uint32_t head = gl_FragCoord.x + gl_FragCoord.y * window_dimensions.x;
	uint32_t new = allocate();

	// Store fragment data in node
	data[new].compressed_diffuse = compressed_diffuse;
	data[new].depth = depth;

	// Start with the head node
	uint32_t previous = head;
	uint32_t current = data[head].next;

	// Insert the new node while maintaining a sorted list.
	// The algorithm finishes in a finite yet indeterminate number of steps.
	// Indeterminate, since some steps may be repeated due to concurrent updates.
	// Thus the total number of steps required for a single insertion
	// is not known beforehand. However, finiteness guarantees
	// that the algorithm terminates eventually. Still, the number of iterations
	// is capped due to hardware constraints.
	for (int i = 0; i < MAX_ITER; ++i) {
		// We are either at the end of the list or just before a node of greater depth...
		if (current == 0 || depth < data[current].depth) {
			// ...so we attempt to insert the new node here. First,
			// the new node is set to point to the current node. It is crucial
			// that this change happens now since the next step makes
			// the new node visible to other threads. That is, the new node must
			// be in a complete state before becoming visible.
			data[new].next = current;

			// Then the previous node is atomically updated to point to new node
			// if the previous node still points to the current node.
			// Returns the original content of data[previous].next (regardless of the
			// result of the comparison).
			uint32_t previous_next = atomicCompSwap(data[previous].next, current, new);

			// The atomic update occurred...
			if (previous_next == current)
				// ...so we are done.
				break;
			// Another thread updated data[previous].next before us...
			else
				// ...so we continue from previous_next
				current = previous_next;
		// We are still searching for a place to insert the new node...
		} else {
			// ...so we advance to the next node in the list.
			previous = current;
			<b>current = data[current].next;</b>
		}
	}

}

As you can see, I store all the linked list nodes in a shader buffer storage object accessible through the data_buffer variable. Each node has a:

  • next: Index to the next node in the list. A value of 0 denotes the end of the list.
  • compressed_diffuse: The fragment's diffuse color which will be blended in pass 3.
  • depth: The fragment's depth used to sort the nodes.

The first (window_dimensions.x * window_dimensions.y) nodes of data_buffer are the head pointers of the linked lists. These special nodes are all cleared in pass 1, so that data.next is initially 0.

Basically, each fragment first allocates a new node to store its data. This is done by incrementing an atomic counter and using the returned value as an index into data. Then, the list is traversed starting at data[head].next. This process is outlined in the code comments.

Bug

The above code is seemingly working but every frame a few random pixels keeps flickering. This indicates that the linked list of said pixels have been corrupted.

I’ve tried inserting memoryBarrier() after all statements to ensure that all writes are visible. This does not help.

However, by using a redundant atomicAdd operation, I can get the algorithm to work:

// We are still searching for a place to insert the new node...
		} else {
			// ...so we advance to the next node in the list.
			previous = current;
			<b>current = atomicAdd(data[current].next, 0); // Atomic read</b>
		}

I can’t explain why. The code should have worked fine using a plain read without the atomicAdd. I hope to get some insight on this.

Interestingly, [Lefebvre et al. 2014] also reports strange behaviour with atomics. They sometimes had to insert redundant memory memoryBarriers() to make their code samples to work. I quote from a comment in their code samples:

“On Kepler and Fermi architectures a race-condition occurs sometimes
in loops with atomic operations. For this reason, a number of tricks
had to be used to escape the race condition. All of these incur
performance penalty, but will hopefully become unncessary in the near
future. The tricks vary between Kepler and Fermi.
Some loops are written in an unatural manner to circumvent these issues.
Similarly, the iteration counters in loops are only used due to race
condition issues (‘iter’ variables).”

Similarly, they have this interesting macro, which they sprinkle throughout the code.

#ifdef Kepler
#define BARRIER memoryBarrier()
#else
#define BARRIER // barriers seem unnecessary on our GTX480 ...
                // in case of trouble put them back in place.
#endif

References

  • Lefebvre, Sylvain and Hornus, Samuel and Lasram, Anass, "Per-Pixel Lists for Single Pass A-Buffer", CRC Press (2014), 3-23. http://www.crcnetbase.com/doi/abs/10.1201/b16721-3.