Race condition,


I’m have issues with my code, a small race condition.

Here is the idea of my algorithm :

several threads write datas (in global mem) then add atomicaly 1 to a global counter after storing.

When the last threads write his datas, it knows it’s last because the global counter has reached

the right value. that last thread perform a computation from datas provided by other threads.

Problem : sometime very rarely, a data that should be set isn’t and the computation is wrong.

i know the race condition happen when the thread doing the final computation get datas from threads

outside the block. so i think i have a data cache incoherency from L2

I wrote that loader is PTX.

device void st_gbl_cg_f4(volatile float4 *addr,volatile float4 v)

asm volatile("
{.reg .f32 e, f, g, h;
mov.f32 e,%0;
mov.f32 f,%1;
mov.f32 g,%2;
mov.f32 h,%3;
st.global.cg.v4.f32 [%4],{e,f,g,h};}
": : “f”(v.x) , “f”(v.y) , “f”(v.z) , “f”(v.w), “l”(addr): “memory” );


the idea is to bypass the cache L1 and avoid incoherencies, but it doesn’t work

i currently use this .

device void st_atomic_f4( float4 *addr, float4 v)

float2 v2 = make_float2(v.x,v.y);
atomicExch((unsigned long long int*)addr,(unsigned long long int)(&v2));
v2 = make_float2(v.z,v.w);
atomicExch(((unsigned long long int*)addr)+1,(unsigned long long int)(&v2));


and it works

2 questions :

-Is it possible writes can be done out of order. In my problem it seems global atomicAdd finish before my float4 global write, even if the atomicAdd is done after float global write ?

-Is something wrong with my ptx code, I’m new at it ?

I’m open to clarify my problems if something isn’t understandable.

thank you