Hello everyone,
Im writing decoder reading variable lengh code .
I have 8 Threads in a block.
A thread does some decoding part, then it enters read section where it requires it’s offset to the data array.
It reads offset, does some data reading and when the variable length of codeword is known it increments the offset by this length
an passes it to the next thread.
So basicly thread N must wait for offset from thread N-1. This is done by waiting spin loop.
While this parallelization works flawlessly on CPU implementation, I run into shared memory problem in CUDA.
Offsets are organized in array, one element per thread. The structure of offset array and the way threads pass offsets makes sure that there is NO race condition, therefore synchronization is not required.
In CUDA implementation im using shared memory to store offset array:
extern __shared__ int BlockArray[];
volatile unsigned int * Offsets = (unsigned int *)BlockArray;
Then, while the offset for thread x is not ready, the thread waits in a spin loop.
while (Offsets[threadIdx.x] < BitOffset) {};
If thread gets valid Offset it leaves spin loop, does some data reading, calculate next offset and store it for the next thread.
BitOffset += CodewordLength;
Offsets[(threadIdx.x + 1) % blockDim.x] = BitOffset;
As i said it works flawlessly on CPU, however on GPU i have some strange problem:
Example:
Thread 0 reads offset 0 and stores next offset 13 for thread 1.
Thread 1 reads offset 13 and stores next offset 27 for thread 0.
But in its second iteration thread 0 does not see that its offset in shared memory is 27. It sees the initial offset 0!!!
Why?
I have followed CUDA Programming guideline, declared Offsets as volatile, even used AtomicCAS and AtomicExch but no chance.
It looks to me that there is some caching going on in background. Once the thread have read shared memory value, all other reads from this location will give the thread old value. It never detects that this shared memory location was changed.
Isnt shared memory supposed to be non cached?
Currently it behaves more like old__view_of_shared
Also i cant use threadsSynchronize() as my kernel threads are building pipeline, where each thread is expected to be in different stage of processing. Using threadsSynchronize (if even possible) will lead to stalling and kill the pipeline as there will be 3 millions of snchronizations per kernel call.
PS:
I know, 8 threads are very low grade of parallelism, but there will be another level of parallelism on grid, without inter-block dependency.