Compute Shader Problem

I’m implementing a simple N-Body simulation using DX11 & Compute Shader, running on GTX 280, driver version 306.97. Theory behind is based on this article:http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html

I also noticed that such simulation is already a part of MS DX SDK (nBodyGravityCS11), where I took some inspiration.

The problem I encountered:

void body_body_interaction(inout float3 ai, float4 bi, float4 bj)
{
	float3 r = bj.xyz - bi.xyz;

    float distSqr = dot(r, r);
    distSqr += g_softeningFactorSq;

	float distInvCube = 1.0f / sqrt(distSqr * distSqr * distSqr);

	//ai += g_FG * bj.w * distInvCube * r; - NOT WORKING
	ai += g_FG *g_fParticleMass * distInvCube * r; //WORKS, g_fParticleMass can be either in cbuffer or global constant, both work
}

Variable bj (xyz - position, w - mass) is at first loaded to shared memory, then GroupMemoryBarrierWithGroupSync() is called to sync group.

[loop]
for(uint block=0; block< num_blocks; ++block)
{
	//Fetch positions to shared cache
	sh_Positions[indexGroup] = oldPar[block * BLOCK_SIZE + indexGroup].pos;
	GroupMemoryBarrierWithGroupSync();

	[unroll]
	for(uint i = 0; i<BLOCK_SIZE; i+=8)
	{
		body_body_interaction(accel, myParticle.pos, sh_Positions[i]);
		body_body_interaction(accel, myParticle.pos, sh_Positions[i+1]);
		body_body_interaction(accel, myParticle.pos, sh_Positions[i+2]);
		body_body_interaction(accel, myParticle.pos, sh_Positions[i+3]);
		body_body_interaction(accel, myParticle.pos, sh_Positions[i+4]);
		body_body_interaction(accel, myParticle.pos, sh_Positions[i+5]);
		body_body_interaction(accel, myParticle.pos, sh_Positions[i+6]);
		body_body_interaction(accel, myParticle.pos, sh_Positions[i+7]);
	}

	GroupMemoryBarrierWithGroupSync();
}

If I use mass stored in bj.w, I end up with NaNs (examined using nSight). Any other way works, using global constant or cbuffer.

Funy thing about this is that if I do the same thing in MS demo, the result is very same - I get no output and buffer contains NaNs. Why am I unable to use 4th vector component from a shared memory in this case?? It is initialized properly on CPU side and the copied to GPU.

Full shader code here:http://pastebin.com/SJhs8ntt

Thank You very much

I tried to re-implement above mentioned algorithm using global memory only, and it works, so there is some issue with shared memory I guess. I tried to mess with group size or reduce particle count, result still the same.

Moreover, how is synchronization realized in CS on CPU side? I mean don’t need computed data on CPU, but I re-use them on GPU to position calculated particles using instancing. Which call actually forces the pipeline to wait until CS is finished? Unbindig UAV? Binding SRV (same buffer as UAV)?

More update - it must be a driver bug or whatsoever, because the shared memory implementation does work correctly on GTX680, same driver version.