Memory writes to the same location doc ver 0.8 vs doc ver 0.8.1

mehussein · April 19, 2007, 11:22pm

Hi,

In CUDA Programming Guide 0.8, it was stated that:

But, in the new version (0.8.1), this was replaced by:

I am confused (and frankly very concerned) about this change. I am wondering if someone from NVIDIA could shed some light on it! In my code, I rely a lot on writing to the same location in global memory by more than one thread and it works as I want it to do. In my case, when more than one thread write to the same memory location, they all write the same value. The problem is that, I do not know in advance which one will write it. So, I let all eligible threads to write the value, and, based on the old documentation, I was guaranteed that at least one of them will make it.

Specifically, I have the following questions:

What is the meaning of “one or more of the threads” in the new doc? How can that be applied to one thread only?
Does this change in the documentation mean that nVidia will stop supporting this feature? Or, they found it malfunctioning? If it is the latter, is there any plan to make this work in subsequent releases?
Do you have any recommendations to turn around this problem in the situation explained above (when all threads write the same value).

Thanks in advance!

MH

prkipfer · April 20, 2007, 9:43am

First, I am not a native English speaker, but to me the new formulation expresses just the same as the old one.

Second, you should never write to the same memory location (at the same time) with more than one thread. See other messages on this board by Mark Harris explaining how to do it without causing threads to diverge.

Third, I don’t understand what you are saying here

If they all write the same value, why bother which one has it? Just pick thread 0 to write it.

Peter

mehussein · April 20, 2007, 1:20pm

Thanks Peter for your reply!

I think there is a big difference between the two versions. If just the order of writes is undefined, then the final value will be one of the written values but we just do not know which one it is. But, if the result is undefined, then the result could be any garbage value not related to any of the actually written values.

In the situation I have in my application, each thread in a block reads a data element and (while processing that element) checks whether a specific condition is true or false. What I am interested to know at the end of the kernel is whether the entire block contains any element with the condition true or not. So, I do that by having a flag in global memory that is initialized to false. If any thread finds the condition true in the element it processes, it sets the flag to true. So, I cannot know in advance which thread will set the flag or how many of them will do.

In the case above, I may solve it by using an array of flags in shared memory, make each thread set only one flag in this array, perform a scan operation at the end to know whether any of them set its flag or not, and then make thread 0 for example write the global memory flag. But, that will be an extra overhead and I am trying hard to further reduce the running time not to increase it. Also, sometimes I need to have a single flag for the entire data stream. In this case, any thread of any block can set the same flag, and I do not know in advance which one or how many ones will do. To perform a scan operation in this case will significantly degrade the performance.

-MH

prkipfer · April 20, 2007, 1:30pm

Wrt to the ordering, the manual is pretty clear that you can never predict the order of execution of blocks or threads within a block. The only sync mechanism is making threads wait.

Doing the shared mem scan as you described is much faster than device mem access. Shared mem is register speed. Nothing is faster. With the syncthreads it is the only possibility to collect block-global state in single pass processing.

If you need to update a flag for the entire grid, there is no way doing that in a single pass. As you said, make thread 0 write the flag for this block to device mem. Then run a second pass with one block that has as many threads as there were blocks before and collect the global flag status. If the result of this operation is relevant for CPU processing rather than GPU, I would download the flag field instead and do the final flag merge on the CPU.

Peter

mehussein · April 20, 2007, 2:53pm

Actually Mark Harris posted a hack to calculate histograms that relied on the fact that only one of multiple writes to the same address will succeed. But, he used it for writing to shared memory. I do not know how will it work for global memory. Here is a link to this thread:

[url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

-MH

prkipfer · April 20, 2007, 3:12pm

Yeah, this technique does not work with device memory, as there is latency both for read/write. So you cannot check in the next cycle whether you have succeeded writing in the immediately preceding statement.

Peter

mehussein · April 20, 2007, 3:50pm

Yeah I see what you are saying. However, I do not need to do that in my case. I just need one of the writes to get through. The value is checked in another kernel.

prkipfer · April 20, 2007, 4:14pm

Right. As I said above when you do two passes, you are fine.

Peter

Cyril_Zeller · April 21, 2007, 5:55am

It won’t be any garbage. The right formulation is: If the instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, how many writes occur to that location and the order in which they occur is undefined, but one of the writes is guaranteed to succeed.

We’ll fix the programming guide to make that clearer.

Thanks,

Cyril

mehussein · April 22, 2007, 7:09pm

Thanks a lot for the clarification!

I have one more question. Does writing from multiple threads to the same memory address simultaneously induce extra latency?

Thanks!

-MH

Cyril_Zeller · May 4, 2007, 3:56pm

Yes if multiple threads actually write to device memory. How many of them do is machine-dependent though. That’s what I meant by “how many writes occur is undefined” in my previous answer.

Cyril

Topic		Replies	Views
Good programming practice Writing shared & global memory CUDA Programming and Performance	13	8030	July 20, 2007
Concurrent writes to global memory CUDA Programming and Performance	1	7676	July 21, 2010
Predicated write to the same location the statement in doc is still ambiguous CUDA Programming and Performance	0	2524	June 17, 2007
non-atomic instruction by other warps? CUDA Programming and Performance	2	1468	March 9, 2009
shared memory writes CUDA Programming and Performance	6	3198	December 30, 2007
Clarification on Memory Access issue CUDA Programming and Performance	1	3753	September 9, 2009
Question regarding global memory write protection CUDA Programming and Performance	1	769	October 1, 2009
CUDA Memory Consistency CUDA Programming and Performance	23	55838	March 8, 2007
Writing to several global memory locations from the same kernel CUDA Programming and Performance	1	1369	June 13, 2008
Undocumented memory pitfalls On correctness, not performance CUDA Programming and Performance	5	4004	August 28, 2007

Memory writes to the same location doc ver 0.8 vs doc ver 0.8.1

Related topics