Memory writes to the same location doc ver 0.8 vs doc ver 0.8.1


In CUDA Programming Guide 0.8, it was stated that:

But, in the new version (0.8.1), this was replaced by:

I am confused (and frankly very concerned) about this change. I am wondering if someone from NVIDIA could shed some light on it! In my code, I rely a lot on writing to the same location in global memory by more than one thread and it works as I want it to do. In my case, when more than one thread write to the same memory location, they all write the same value. The problem is that, I do not know in advance which one will write it. So, I let all eligible threads to write the value, and, based on the old documentation, I was guaranteed that at least one of them will make it.

Specifically, I have the following questions:

  1. What is the meaning of “one or more of the threads” in the new doc? How can that be applied to one thread only?

  2. Does this change in the documentation mean that nVidia will stop supporting this feature? Or, they found it malfunctioning? If it is the latter, is there any plan to make this work in subsequent releases?

  3. Do you have any recommendations to turn around this problem in the situation explained above (when all threads write the same value).

Thanks in advance!


First, I am not a native English speaker, but to me the new formulation expresses just the same as the old one.

Second, you should never write to the same memory location (at the same time) with more than one thread. See other messages on this board by Mark Harris explaining how to do it without causing threads to diverge.

Third, I don’t understand what you are saying here

If they all write the same value, why bother which one has it? Just pick thread 0 to write it.


Thanks Peter for your reply!

I think there is a big difference between the two versions. If just the order of writes is undefined, then the final value will be one of the written values but we just do not know which one it is. But, if the result is undefined, then the result could be any garbage value not related to any of the actually written values.

In the situation I have in my application, each thread in a block reads a data element and (while processing that element) checks whether a specific condition is true or false. What I am interested to know at the end of the kernel is whether the entire block contains any element with the condition true or not. So, I do that by having a flag in global memory that is initialized to false. If any thread finds the condition true in the element it processes, it sets the flag to true. So, I cannot know in advance which thread will set the flag or how many of them will do.

In the case above, I may solve it by using an array of flags in shared memory, make each thread set only one flag in this array, perform a scan operation at the end to know whether any of them set its flag or not, and then make thread 0 for example write the global memory flag. But, that will be an extra overhead and I am trying hard to further reduce the running time not to increase it. Also, sometimes I need to have a single flag for the entire data stream. In this case, any thread of any block can set the same flag, and I do not know in advance which one or how many ones will do. To perform a scan operation in this case will significantly degrade the performance.


Wrt to the ordering, the manual is pretty clear that you can never predict the order of execution of blocks or threads within a block. The only sync mechanism is making threads wait.

Doing the shared mem scan as you described is much faster than device mem access. Shared mem is register speed. Nothing is faster. With the syncthreads it is the only possibility to collect block-global state in single pass processing.

If you need to update a flag for the entire grid, there is no way doing that in a single pass. As you said, make thread 0 write the flag for this block to device mem. Then run a second pass with one block that has as many threads as there were blocks before and collect the global flag status. If the result of this operation is relevant for CPU processing rather than GPU, I would download the flag field instead and do the final flag merge on the CPU.


Actually Mark Harris posted a hack to calculate histograms that relied on the fact that only one of multiple writes to the same address will succeed. But, he used it for writing to shared memory. I do not know how will it work for global memory. Here is a link to this thread:…16&hl=histogram


Yeah, this technique does not work with device memory, as there is latency both for read/write. So you cannot check in the next cycle whether you have succeeded writing in the immediately preceding statement.


Yeah I see what you are saying. However, I do not need to do that in my case. I just need one of the writes to get through. The value is checked in another kernel.

Right. As I said above when you do two passes, you are fine.


It won’t be any garbage. The right formulation is: If the instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, how many writes occur to that location and the order in which they occur is undefined, but one of the writes is guaranteed to succeed.

We’ll fix the programming guide to make that clearer.



Thanks a lot for the clarification!

I have one more question. Does writing from multiple threads to the same memory address simultaneously induce extra latency?



Yes if multiple threads actually write to device memory. How many of them do is machine-dependent though. That’s what I meant by “how many writes occur is undefined” in my previous answer.