problem with 32bit atomics on 195?

I can’t yet get a very precise description, but here are the main facts:
Code that runs fine on 189-191 has the following problems on 195:

  1. __constant is no longer accepted as kernel-argumentmodifier (might be intentional, of course);

  2. the kernel now freezes the machine after compilation with omitted argument modifiers (see 1)
    The code is rather of the same type as nbody, but implements a scheme for symmetric forces, i.e. every combination is computed only once. The scheme requires a bit of workgroup coordination by locking a global var (acceleration) when reults are added.
    This is accomplished by atom_or() and atom_xor() at the end of the kernel.

     const unsigned block=OPPBLOCK(index);  // OPPBLOCK is a macro
     const unsigned ba=block>>LOCKADDRSHIFT;  // allow for several locking values per 32 bit. LOCKADDRSHIFT is typically 2; larger values cause problems.
     const unsigned bitvalue=1<<(block-(ba<<LOCKADDRSHIFT));
     while ((atom_or(&acc_locked[ba],bitvalue)&bitvalue)!=0); // try to lock and proceed when not previously locked
     accs[block].x+=tmp4.x; accs[block].y+=tmp4.y; accs[block].z+=tmp4.z;
     atom_xor(&acc_locked[ba],bitvalue);  // unlock

Locking proved necessary, omitting it gives somewhat inaccurate results. My guess is that this part of the code causes the freeze on 195.39, but I have no proof as yet, because I do not have access to 195.39 all the time.
So, maybe I’m doing something weird, but then, maybe it is worhwhile to check this one out as a potential bug.

The full code can be downloaded from:
good luck,

Quote left out, update:
I changed the coordination of writing to a global array by different threadblocks.
I don’t use semaphore’s any longer, but use atom_cmpxchg() on floats as if they were ints.
This takes a lot of casting, but hey, it works, both on 195 and on pre 195.
So, what I do, is use atom_cmpxchg() to check that the base value used to calculate the new value is unmodified at the time of replacement; if not, the proces is repeated with the changed base value, until there is a match. I have not had freeze-outs resulting in a “CL_OUT_OF_RESOURCES” error sofar, which I had frequently using the semaphore method.

The compiler on 195.81 crashes if you omit an explicit cast of the first arg to atom_cmpxchg() as a __global int *.
Further, I had some trouble with pointers: afaik the compiler generates the same pointer for float4.x .y and .z.