The reduction examples in the 2.3 SDK did not use volatile shared memory. The same example code in the 3.1 SDK does use volatile shared memory, and apparently really needs to or bad things happen (I know because I had old code based on the old reduction sample, but have the new 3.1 compiler in use).
Can someone explain what has changed? I have seen no mention of anything like this in release notes for 3.0 or for 3.1.
It is mentioned in the Release Notes for 3.1.
I cannot give you any background but speculate that the compiler has become more aggressive in reordering shared memory accesses. Obviously, using volatile would have been semantically correct from the beginning.
I wonder when Nvidia introduces a __threadfence_warp(), which I think would be the proper way to handle it.
EDIT: In this presentation, Gernot Ziegler of NVIDIA gives an explanation: “Reason: C1060 (GT200) could access shmem directly as operand, while C2050 (Fermi) uses load/store architecture into registers!”. To me this looks very much like you could run into the problem on GT200 too if you happened to use two shared memory operands in one instruction, but usually you would be lucky and it just worked.
It is mentioned in the Release Notes for 3.1.
I cannot give you any background but speculate that the compiler has become more aggressive in reordering shared memory accesses. Obviously, using volatile would have been semantically correct from the beginning.
I wonder when Nvidia introduces a __threadfence_warp(), which I think would be the proper way to handle it.
EDIT: In this presentation, Gernot Ziegler of NVIDIA gives an explanation: “Reason: C1060 (GT200) could access shmem directly as operand, while C2050 (Fermi) uses load/store architecture into registers!”. To me this looks very much like you could run into the problem on GT200 too if you happened to use two shared memory operands in one instruction, but usually you would be lucky and it just worked.