Directcompute rewriting to TGSM

Hi,
I have a directcompute shader using TGSM which does something like the following

groupshared float4 sm[64];
int main()
{

sm[thread_idx] = v;
GroupMemoryBarrierWithGroupSync();
// use sm value, everything is good here
sm[thread_idx] = v2;
GroupMemoryBarrierWithGroupSync();
// use sm values again, values are corrupted now
}

So the 2nd time I try to write and read again from the same TGSM as before, it somehow doesn’t work correctly. If I use a second TGSM area for the 2nd usage, everything works fine. I’ve read (http://www.nvidia.com/content/GTC-2010/pdfs/2260_GTC2010.pdf) to try to write/read from TGSM only once, but nothing says it won’t work? Any thoughts? I am not bottlenecked by TGSM storage, but instead by register usage, so this is not a big problem (for now), but I would still like to know the reason why I cannot reuse a TGSM area in a shader. This is on cs5.0 on a GTX980Ti with the 361.43 drivers.

Thanks

Ping. Any thoughts, wondering if it’s a driver bug or a spec limitation?