shared memory intra-warp conflicts summing into shared memory, how?

safars · September 3, 2009, 8:46pm

Is threre any way to emulate atomic memory access in shared memory for floats?

What I’d like to do is sum numbers into shared memory. Every thread determines the position it wants to write to separately, which means that sometimes two (or more) threads try to write to the same location, of which accesses only one gets written. I tried the following:

[codebox]shared float array_ptr[1024];

volatile float *some_position = array_ptr + some_offset; // that’s the array we’re trying to sum into

float new_value = *some_position + increment;

*some_position = new_value;

while(*some_position != new_value) {

new_value = *some_position + increment;

*some_position = new_value;

}[/codebox]

so (assuming the increments are different) every thread would loop until its additional value gets added to the sum.

(This didn’t really work though.)

When you write something and read it back, does it get read directly from the final value of the shared memory or could it be subject of later change? Do the reads occur after all the writes from the warp? Or… how do you avoid conflicts like that?

Comments welcome :)

SPWorley · September 3, 2009, 9:14pm

There are indeed many solutions. The fastest one may depend on the write behavor, though… perhaps it’s rare for two threads to ever collide (then a good solution is to make a vote array of 32 slots, each thread writes the slot it wants to update, then if it “wins” the slot write, it gets to update the float.) That sucks if all 32 threads want to write to the same array value though… you need 32 passes, and each one of those passes will have many-way bank conflicts (= slow). This method also works OK for sparse writes where only 2 or 3 threads have anything to write anyway.

Another way is to do it round-robin. This is good if there’s lots of data to write and there may be lots of conflicts. It’s even better if each thread has multiple values it wants to write for multiple slots because those multiple writes are free.

The basic idea is to do something like

// assume wid = thread ID from 0 to 31

for (int delta=0; delta<32; ++delta)  Do_Any_Ops_I_Want_To_Slot (31&(wid+delta);

This does require 32 loop iterations and therefore has overhead, but never has collisions since every thread gets to visit every slot without interference.

safars · September 5, 2009, 9:10am

Thanks for the ideas! By the way, I discovered that mine works too if I put a __threadfence() after the write attempt, so we’ve got three working alternatives now :)

Unfortunately, I realized that the problem wasn’t only the intra-warp conflicts but the race conditions between the warps of the block, trying to write to the same place… But that could be solved too with shared memory atomics and the same way of looping (can’t increment a float atomically but atomicCAS can be used to make sure the current thread’s incremented version got written).

Topic		Replies	Views
Shared memory write conflicts Looking for a little help... CUDA Programming and Performance	5	4931	September 7, 2007
Shared memory issue CUDA Programming and Performance	4	750	July 18, 2013
Warp writes to the shared memory CUDA Programming and Performance	0	1655	June 2, 2009
shared memory writes CUDA Programming and Performance	6	3164	December 30, 2007
Best way to do scatter write without memory conflict? CUDA Programming and Performance	6	5746	April 2, 2009
Conflict in shared memory CUDA Programming and Performance	5	5835	November 16, 2010
beginner question regarding shared memory CUDA Programming and Performance	4	6941	November 16, 2009
When bank conflicts in shared memory, serialized request is the order fixed? CUDA Programming and Performance cuda	4	40	August 12, 2024
Shared Memory Problems ... Conflict free access CUDA Programming and Performance	22	3527	August 24, 2010
Memory conflict I think.... CUDA Programming and Performance	3	2756	June 10, 2008

shared memory intra-warp conflicts summing into shared memory, how?

Related topics