shared memory intra-warp conflicts summing into shared memory, how?

Is threre any way to emulate atomic memory access in shared memory for floats?

What I’d like to do is sum numbers into shared memory. Every thread determines the position it wants to write to separately, which means that sometimes two (or more) threads try to write to the same location, of which accesses only one gets written. I tried the following:

[codebox]shared float array_ptr[1024];

volatile float *some_position = array_ptr + some_offset; // that’s the array we’re trying to sum into

float new_value = *some_position + increment;

*some_position = new_value;

while(*some_position != new_value) {

new_value = *some_position + increment;

*some_position = new_value;

}[/codebox]

so (assuming the increments are different) every thread would loop until its additional value gets added to the sum.

(This didn’t really work though.)

When you write something and read it back, does it get read directly from the final value of the shared memory or could it be subject of later change? Do the reads occur after all the writes from the warp? Or… how do you avoid conflicts like that?

Comments welcome :)

There are indeed many solutions. The fastest one may depend on the write behavor, though… perhaps it’s rare for two threads to ever collide (then a good solution is to make a vote array of 32 slots, each thread writes the slot it wants to update, then if it “wins” the slot write, it gets to update the float.) That sucks if all 32 threads want to write to the same array value though… you need 32 passes, and each one of those passes will have many-way bank conflicts (= slow). This method also works OK for sparse writes where only 2 or 3 threads have anything to write anyway.

Another way is to do it round-robin. This is good if there’s lots of data to write and there may be lots of conflicts. It’s even better if each thread has multiple values it wants to write for multiple slots because those multiple writes are free.

The basic idea is to do something like

// assume wid = thread ID from 0 to 31

for (int delta=0; delta<32; ++delta)  Do_Any_Ops_I_Want_To_Slot (31&(wid+delta);

This does require 32 loop iterations and therefore has overhead, but never has collisions since every thread gets to visit every slot without interference.

Thanks for the ideas! By the way, I discovered that mine works too if I put a __threadfence() after the write attempt, so we’ve got three working alternatives now :)

Unfortunately, I realized that the problem wasn’t only the intra-warp conflicts but the race conditions between the warps of the block, trying to write to the same place… But that could be solved too with shared memory atomics and the same way of looping (can’t increment a float atomically but atomicCAS can be used to make sure the current thread’s incremented version got written).