The manual 1.0 says that threads inside a warp cannot assume strict ordering among themselves with respect to shared memory. One has to use __syncthreads for this. But __syncthreads makes it to synchronize with all other threads in the BLOCK which is a very costly operation.
When I experimented with the PTX assembly code that is generated I realized that the compiler optimizes the LOAD from shared memory with “registers”. So, I guessed that the use of “volatile” keyword would solve the problem. And, it did. Later, I also found the same reference in 1.1 manual too. So, THis kinda confirms what I have seen.
I just want a small confirmation from NVIDIA that apart frm the “volatile” keyword there is NO other restriction (like a hardware limitation) that prevents a strict ordering among threads in a warp.
If I use volatile variables then THREAD I should be able to see the latest volatile Shared memory data that was generated by THREAD J – given I and J belong to same warp.