WARP , Shared Memory, synchronization Synchronisation within WARP threads

HI,

The manual 1.0 says that threads inside a warp cannot assume strict ordering among themselves with respect to shared memory. One has to use __syncthreads for this. But __syncthreads makes it to synchronize with all other threads in the BLOCK which is a very costly operation.

When I experimented with the PTX assembly code that is generated I realized that the compiler optimizes the LOAD from shared memory with “registers”. So, I guessed that the use of “volatile” keyword would solve the problem. And, it did. Later, I also found the same reference in 1.1 manual too. So, THis kinda confirms what I have seen.

I just want a small confirmation from NVIDIA that apart frm the “volatile” keyword there is NO other restriction (like a hardware limitation) that prevents a strict ordering among threads in a warp.

i.e.

If I use volatile variables then THREAD I should be able to see the latest volatile Shared memory data that was generated by THREAD J – given I and J belong to same warp.

Thank you

Greatly Appreciate a reply for this question. Thanks a lot.
The algo that I am going to choose depends on this and this can matter a lot.
Thank you.