Shared memory bytewise memory write guarantees

Shared memory is organized by banks, with reading and writing done in 32 bit chunks.

In a 32 bit per-word basis, CUDA guarantees that if multiple threads write to a 32 bit word in shared memory, there is no promise which write will succeed, however there is a promise that at least one write will.

Shared memory can also be addressed, read, and written to per byte. This can create inefficient bank conflicts (as described in the CUDA programming guide in section, but it is supported.

My question is more about what promises CUDA gives us about write collisions on a per-byte basis.

Ignoring any inefficiencies about bank conflicts, does CUDA guarantee that if a thread writes to a byte location in shared memory, that write will be recorded even if other threads may be writing to nearby addresses?

I worry that there may be a case where one thread writes a value to x[0]. and another thread (maybe in the same warp, maybe not) writes to x[1]. If shared memory is updated in 32 bit chunks, it may be that a byte-wise write effectively is something like “read the 32 bit word, update the byte you’re changing, then write the 32 bit word back.” If so, then you have the situation where a write to x[0] might not “take” if another write to say x[1] overlaps its evaluation.

The CUDA guide is silent about this, it sort of skirts around the whole byte-wise addressing issue likely because the hardware deals with 32 bits as its natural chunksize.

To put this question in context, I’m using 1024 bytes as an existence hash table. I initialize them all to 0. I stream through data and for each item, I compute a 10 bit hash value and set the corresponding byte in shared memory to 1 to know that at least one item with that hash was seen.

These writes are all done in parallel with no syncthreads or anything. Even if two values both wrote to the same array location at once, I don’t mind since I am just storing a “yes, somebody hit here” flag value. I am not doing any incrementing count, so I don’t need to worry about atomic access.

This strategy works perfectly if I write to 32 bit array values. When I change to byte-wise addressing, it also seems to work, but occasionally I get strange unreproducible differences in output. They’re rare (1 run out of 50), and making a simplified test case always works.

I am hypothesizing that this lack of bytewise addressing write guarantees may be the issue, but it’s hard to tell and before I dig anymore it’s useful to ask if anyone else has ideas.

Has anyone found anything similar situation, or have any more ideas? I hate using my precious shared memory to store a single bit of data per 32-bit word… I’d be a lot more comfortable knowing if it’s safe to use 8-bit addressing.

Out of curiosity, could you please post the relevant code? I always wondered whether I could do that update (set a value/bit to one from different threads)

without problems.



Set a value, yes, if it’s a boolean one-way flip like x[index]=1. Set bit, no way, because that requires you to load the value, apply your bitmask, then write the value, leaving plenty of time for races.

Pseudocode is easy

Make a shared memory array. Initialize to 0.

Loop through your data. Each piece may come up with one or more index values you want to record.

if (i_want_to_write_this_hash) x[index]=1;

That’s about it. The important part is that the array is initialized to 0, you’re writing only a constant value (so order is irrelevant), and you’re especially NOT reading a value in the loop, even to mask in a bit or something.

Works great with no syncthreads for 32 bit arrays, and is guaranteed by CUDA’s docs.

8-bit writes… well, that’s what I’m asking. :-)

This is a really interesting question, so I hope we get a response from tmurray soon.