Atomic operation and variable access

Is it fair to say that if I have a kernel in which I perform an atomic operation (say an atomicAdd) on foo in global memory and later in the same kernel i read the value of foo, then I’m looking for trouble? It can be the other way too: reading foo and then atomically incrementing it.

I can imagine a situation when a thread from a block reads foo through a pointer while a thread from a different block, exactly at the dame time, is atomically operating on foo. While this could happen rarely, when it does the behavior i imagine it’s undefined.

Am i getting this right?

The behavior isn’t purely UB, in my opinion. However there is no particular guarantee of ordering unless you go to some lengths to impose it.

The behavior is that a read will get a value that was there before the atomic op, or else after the atomic op. Which one is undefined unless you go to some length to impose ordering (both execution barriers as well as visibility guarantees). (**)

We should distinguish the above from the behavior when a single thread is doing all the activity, which doesn’t seem to be what you are asking about.

Let’s be specific. Suppose location x has the value 100. Let’s suppose that one thread (A) is doing an atomic add to x of 1, and another thread (B) is reading x. Let’s also assume there is no other activity of any kind with respect to x.

B should expect read either 100, or 101. No other values should be expected/possible.

(**) using “undefined” here strikes me as obvious. Two unsynchronized threads may or may not see each other’s writes. I don’t think this concept is unique or specific to CUDA. But if you need to refer to that as undefined, so be it.

Robert - thanks for your input, I see it the same way, with one caveat.
Is there a distinction to be made between UB and a race condition? If i understand correctly your point of view, you are saying that the problem i described would lead to a race condition.

However, can it be UB, that is, worse than a race condition?
Here’s what i mean, perhaps you can help me see things straight.

Let’s say that the atomic op is on a 64 bit wide variable. Could it be that the first 32 bits are updated one clock cycle before the last 32 bits, whilst the read proceeds by first reading the last 32 bits and then the first 32 bits?

What you suggested in your post is that you can read a dog or a cat. I’m asking whether there can be a bad case when the variable i’m getting is part dog, part cat.

Thanks for your input. Much appreciated.

In the general (non-atomic) case, I would certainly suggest that you avoid the scenario where different threads are writing to locations that overlap but are of different sizes. That is a very complex scenario to unpack.

However in this case, I would trust the use of the word atomic here to be exactly what it implies. for a 64-bit atomic, that is an uninterruptable Read-Modify-Write operation, on a properly aligned (naturally aligned) 64-bit quantity.

At the moment, if one of the operations is an atomic, I’m at a loss to explain how you might observe something other than a dog or a cat. I don’t think it is possible.