Concurrent shared memory read/write access

Hello everyone,

Im writing decoder reading variable lengh code .

I have 8 Threads in a block.

A thread does some decoding part, then it enters read section where it requires it’s offset to the data array.

It reads offset, does some data reading and when the variable length of codeword is known it increments the offset by this length

an passes it to the next thread.

So basicly thread N must wait for offset from thread N-1. This is done by waiting spin loop.

While this parallelization works flawlessly on CPU implementation, I run into shared memory problem in CUDA.

Offsets are organized in array, one element per thread. The structure of offset array and the way threads pass offsets makes sure that there is NO race condition, therefore synchronization is not required.

In CUDA implementation im using shared memory to store offset array:

extern __shared__ int BlockArray[];

volatile unsigned int * Offsets = (unsigned int *)BlockArray;

Then, while the offset for thread x is not ready, the thread waits in a spin loop.

while (Offsets[threadIdx.x] < BitOffset) {};

If thread gets valid Offset it leaves spin loop, does some data reading, calculate next offset and store it for the next thread.

BitOffset += CodewordLength;

Offsets[(threadIdx.x + 1) % blockDim.x] = BitOffset;

As i said it works flawlessly on CPU, however on GPU i have some strange problem:


Thread 0 reads offset 0 and stores next offset 13 for thread 1.

Thread 1 reads offset 13 and stores next offset 27 for thread 0.

But in its second iteration thread 0 does not see that its offset in shared memory is 27. It sees the initial offset 0!!!


I have followed CUDA Programming guideline, declared Offsets as volatile, even used AtomicCAS and AtomicExch but no chance.

It looks to me that there is some caching going on in background. Once the thread have read shared memory value, all other reads from this location will give the thread old value. It never detects that this shared memory location was changed.

Isnt shared memory supposed to be non cached?

Currently it behaves more like old__view_of_shared

Also i cant use threadsSynchronize() as my kernel threads are building pipeline, where each thread is expected to be in different stage of processing. Using threadsSynchronize (if even possible) will lead to stalling and kill the pipeline as there will be 3 millions of snchronizations per kernel call.


I know, 8 threads are very low grade of parallelism, but there will be another level of parallelism on grid, without inter-block dependency.


If thread N waits for the results of thread N-1 before doing it’s own computation, this is a sequential algorithm, not a parallel one.

Use __syncthreads() to syncronize the threads of a block. Do not try to invent spinlocks, as quite likely they don’t do what you think they do.

Portion of code that fetches offset and calculates new one takes 5%, remaining 95% can safely run in parallel.
This is pipeline. Only offset calculations can’t overlap. In fact if those calculations do not come close to each other you get nearly full speedup by the number of cores.
As I said CPU implementation already does this, and there will be 3 millions “offset synchronizations” per function call.
Im not trying to invent anything, spin loops came up to be the best solution, since they dont stall the entire CPU pipeline or my pipeline, only the waiting thread is blocked (as it should be). Spin loops also prevent thread from being suspended. Suspending single thread will stall the entire pipeline for around 10.000 CPU cycles.

I cannot get rid of offset dependency, beacuse the data coming from hardware encoder is aligned that way.

I thought it was problem of shared memory, but I spoke one of our CUDA guys in university, he says that Threads 1 is probably completely suspended by scheduler while Thread 0 is waiting for the offset from suspended Thread 1. This lasts for nearly 2 seconds until the GPU driver is reset by the OS.
May be it is really scheduler problem, im not sure now…

Already tried volatile, shared and global memory, __threadfence(), __threadfence_block(), __synchthreads() with and without atomic reads/writes - result is always the same, thread A don’t see the changes made to shared memory by thread B.

Spinlock in the CPU are fine of course. What I wanted to say is, spinlocks on the GPU implemented straightforward like on the CPU just won’t work!

Can you show us the relevant part of your __syncthreads()-based implementation? That’s the one that is supposed to work.

__global__ void DecoderKernel(DecodingObject DO)


// Initializations...

int TID = threadIdx.x;

int NextTID = (TID + 1) % DO.Threads;

unsigned int BitOffset = TID == 0 ? 0 : 1;

volatile unsigned int * Offsets = (unsigned int *)BlockArray;

for (unsigned int i = 0; i < DO.Threads; i++)

  Offsets[i] = 0;


while (true)


	// Long decoding part...

	int ErrorCounter = 0;

	while (Offsets[TID] < BitOffset) // spin loop



		if (100000 < ++ErrorCounter)

		{ // give up and return offsets visible from this thread.

			*((unsigned int *)&DO.Output[8*TID]) = Offsets[0];

			*((unsigned int *)&DO.Output[8*TID + 4]) = Offsets[1];




	// Now next offset is known, use it for reading DataBuffer

	BitOffset = Offsets[TID];

	// Reading variable bit length data

	unsigned char Codeword = Read(DO.DataBuffer, BitOffset); // Read also updates BitOffset

	// Provide updated BitOffset to the next thread.

	Offsets[NextTID] = BitOffset;


	// Another long decoding part



Im testing with 2 threads.

First thread should immediately skip spin loop, but wait on second __syncthreads() after it stored offset intro shared memory for second thread.

Second thread enters spin loop and stops on __syncthreads().

After both threads hit __syncthreads() - one inside of spin loop and the other outside, second thread exit spin loop, calculate next offset, stores it for the first thread and stops at __syncthreads() outside of spin loop. At the same time first thread completes one decoding cycle and stop at __syncthreads() inside spin loop. This is not exactly what I want to achieve, this basicly serialize the pipeline again, but even this approach is not working. In reality each thread completes just one decoding cycle, then they don’t see offset results of each other and stuck “forever” inside spin loop until terminated by 100000 < ErrorCounter

after kernel give up and exit, the results of stored offsets array are:

Thread 0 sees:

Offsets[0] = 0

Offsets[1] = 16

Thread 1 sees:

Offsets[0] = 32

Offsets[1] = 16

I dont understand why Thread 0 doesn’t see Offsets[0] = 32

Im stil wondering if its caching or scheduling issue.

__syncthreads() must be encountered by all threads in a block, otherwise behavior is undefined.

And don’t try spinlocks at all. Threads on the GPU do not execute independently, so if one spins on a lock, you cannot expect the others to finish and release the lock.

You should structure your code somewhat like this:

__global__ void kernel() {

    for (int i=0; i<blockDim.x; i++) {

        if (i==threadIdx.x) {

            // do (sequential) calculation for thread i...




    // parallel part...


Thank you for that hint,
I will try to deskew my pipeline, serialize the part of code for offset processing, and post back if I was successful.