Is a storing a double atomic?

Hi! I am writing the following kernel to implement the Bellman-Ford routing algorithm. The weight of the edges needs to be in double precision. The edges are represented using d_from for the starting node, d_to for the destination node and d_edge_weigth fot the weigth of the edge. The vector d_node_weight is initialize to DBL_MAX for all node except the starting node which is set to 0. The kernel updates the nodes’s weight using an iterative process. Once no more changes are required, d_node_weight contains the sum of the weights to go from the starting node to node i.

My question is: when writing to d_node_weight[to_node]

d_node_weight[to_node] = from_weigth_node + weigth;

is the storing operation done atomically i.e. is it guaranteed that all 64 bits forming the double will come from the same thread. I know that many threads could be writing to the same d_node_weight[to_node], but I want to make sure that I will never get, lets say the first 32 bits coming from one thread when the other 32 bits come from another thread.

Thank you for you help.

__global__ void bellman_kernel(
    int *d_from,            ///< [in]  Pointer to the from nodes
    int *d_to,              ///< [in]  Pointer to the to nodes
    double *d_edge_weigth,  ///< [in]  Pointer to the weigth vector
    int nedges,             ///< [in]  Number of elements in the d_from, d_to and d_weigth vectore
    double*d_node_weight,   ///< [in/out] Pointer to the weigth of the nodes
    bool *d_modified_flag)  ///< [out] Pointer to the flag, set to true if a change was made, set to false if no change was made
{
    const unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
    const unsigned int grid_size = gridDim.x * blockDim.x;

    for (int i = tid; i < nedges; i = i + grid_size)
    {
        int from_node = d_from[i];
        int to_node = d_to[i];
        real weigth = d_edge_weigth[i];
        real from_weigth_node = d_node_weight[from_node];
        if (from_weigth_node + weigth < d_node_weight[to_node]) {
            d_node_weight[to_node] = from_weigth_node + weigth;
            *d_modified_flag = true;
        }
    }
}

yes, in the respect you describe, it is “atomic”.

Assuming all threads that are writing to it are using the same width and using the same built-in datatype or nvidia-provided vector type of up to 16 bytes, anything stored there, observed at any time, will be a coherent value that emanated from one and only one thread.

Thank you txbob for your comment. I found the following about the C++ language which says that storing a double is not atomic and that one should use the function mint_store_64_relaxed(&sharedValue, value); to store it atomically. This makes me wonder about CUDA. Would you have a reference that supports your explanation, it would make me feel much more comfortable, especially that you mentioned the NVIDIA-provided vector types of up to 16 bytes. These could be very handy in my application.
Best regards,

I think the general case is implementation dependent. I think in general for fairly modern x86 or x64 systems its reasonable to expect atomicity in certain cases (with appropriate alignment, etc.):
https://stackoverflow.com/questions/1292786/is-updating-double-operation-atomic

With respect to CUDA C++, the implementation is “mostly” spelled out here:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses

You should also take into account additional info in the arch-specific hardware implementation:

For example, for cc 3.0:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0

Please read these references in their entirety, as the first contains discussion of support of writes of up to 16 bytes by a single thread. I will refer to some specific excerpts below.

Let’s first consider the case where writes are emanating from separate warps. This (in my view) is entirely covered in the first programming guide (PG) reference:

“Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions.”

and

“When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads.”

Since the writes from separate warps are defined to be accomplished in separate transactions to memory (or a cacheline, if you prefer) a “transaction” can only contain data from one thread for a specific location (because that was our governing assumption for this first part of the treatment). Therefore the accesses are serialized and independent, because they are carried out in separate transactions. Updates to the same location will be carried out in separate, independent transactions. The data in each transaction, for a given location is “coherent” by definition and our stipulation for this part of the treatment (one thread per warp per location).

To consider the case where multiple requests to the same location emanate from the same warp (and therefore must be resolved in a single transaction) we must factor in information from the 2nd PG reference:

“If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined.”

Therefore, the case of multiple accesses to the same location by the same warp will be collapsed into a single update to that location, from a single thread. Which thread (in the warp) “wins” is undefined. This “collapsing” to a single update allows this case to dovetail into our previous treatment/stipulation.

That’s all I have. I don’t think I would be able to respond to further requests for clarification.

Hi txbob! This is an excellent explanation. It all makes sense now. Thank you very much.
Regards,

In addition to the references given by txbob, the PTX manual now has a precise, formal definition of the memory consistency model of CUDA:
http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#memory-consistency-model
Note this is way more detailed than most memory model descriptions in “conventional” ISA documentations.

For a higher-level overview and discussion on how it relates to C++ memory semantics, this presentation from Olivier Giroux of Nvidia at CppCon should also be a good start:
https://www.youtube.com/watch?v=86seb-iZCnI