NaN

gpugpu · February 28, 2009, 1:16am

I have a function in my kernel in which more than all threads from the thread block are adding a float value to a single memory location in shared memory. Then that single location in shared memory is added to a single location in global memory, can this cause a NaN value? should I use atomicAdd()?

SPWorley · February 28, 2009, 7:47am

Two topics in your question:

If you add a NaN to any other value the result is NaN. So the “NaN-ness” is sticky, and likely any NaNs in your source data will end up with your results being NaN
If you’re just looking to sum many values from different threads, that can be done with atomics, but please study the “reduction” example in the SDK which shows the most efficient way. Reduction is a common pattern in CUDA, it’s important to understand it well.
In your specific case, it sounds like you’re only adding one value per block to a global accumulator, and for that atomics might indeed be most efficient since you’re likely adding a few hundred values or whatever. But if you have 500 threads adding their results to a single shared memory location, that’s likely best done with a reduction.

gpugpu · March 1, 2009, 1:06am

So, my main question here is: if multiple threads add to a single memory location simultaneously is that guarded by the hardware OR that will cause invalid values to be stored in that location?

SPWorley · March 1, 2009, 2:02am

Yes. An atomic add of a NaN to memory location should result in an NaN.

Of course I haven’t tested it…

MisterAnderson42 · March 1, 2009, 2:11am

If you write to the same memory location with += by all threads then you are guaranteed to get garbage results from the race condition.

The hardware only performs adds (or other operations) atomically when you explicitly use the atomic* functions to do so.

gpugpu · March 1, 2009, 2:57am

Thanks MisterAnderson42. So, reduction is not possible in global memory if the hardware does not support atomicAdd?
Because my ultimate goal is to sum each single value from each thread block to a single location in global memory.

I have Quadro FX 5600 and I guess it is compute 1.0 which doesn’t support atomicAdd. Is that right?

Thanks.

Mr_Nuke · March 1, 2009, 4:00am

Yes, that’s correct. The FX 5600 is compute capability 1.0, which doesn’t support atomic instructions.

SPWorley · March 1, 2009, 10:07am

You can do global reduction even on 1.0 hardware just by using kernel launches as barriers. You probably only have 100-10000 values to add (you said it’s one value from each block) so just launch a new kernel to do the reduction on those. Kernel launches are cheap, only 15us overhead or so, so even though you “waste” a lot of MPs which are idle, it’s not really a big deal since the final reduction step is small. Most of your effort will be in the initial reduction where you’re efficiently using all your bandwidth doing the initial per-block reduction.

Depending on your application, it may be useful to skip the last reduction pass and just do it on the CPU, but often you want the result on the GPU so you may as well do the compute there.

So in summary:

First kernel: Lots of blocks, say 500. Each one does its own reduction on a chunk of data, and writes the final per-block value to global memory. You now have 500 values written in device memory.
Second kernel. Just one block! It reads 500 values into shared memory, does a reduction, and writes out one single number (your answer).

There’s a mental unpleasantness about that second kernel (“What a waste, most of your GPU is idle!”) but in practice it’s really a small overhead indeed, negligable compared to kernel #1 if it has enough work to keep itself busy.

_Big_Mac · March 1, 2009, 2:09pm

IIRC the Programming Guide states that in a race situation, one thread is guaranteed to perform a successful write but it’s undefined which one. So it’ll be garbage from the algorithmic point of view but it shouldn’t be a “random bit” king of garbage that might result from clashing writes.

MisterAnderson42 · March 2, 2009, 12:52pm

Yes. This is exactly what one should do. HOOMD has several kernels like this. You also need this technique if you are reducing floating point variables.

Your clarification is entirely correct. I’ve found typically with intential += race conditions, the resulting value always ends up much smaller than it should be, confirming this.

Topic		Replies	Views
AtomicAdd algorithm CUDA Programming and Performance	7	3693	August 25, 2009
atomicAdd CUDA Programming and Performance	4	3405	September 9, 2008
Threads and Race Condition CUDA Programming and Performance	11	2966	April 30, 2012
Using reduction instead of atomics? CUDA Programming and Performance	9	5575	March 9, 2015
Writing results into global array for only some threads CUDA Programming and Performance	5	1686	April 6, 2009
Variable Number of Results CUDA Programming and Performance	3	1680	April 10, 2009
Can we use "AtomicAdd()" with GTX 8800? Any other option to do same thing...? CUDA Programming and Performance	14	5765	January 2, 2008
Help with memory management CUDA Programming and Performance	20	5765	March 27, 2010
atomicAdd and concurrent kernels CUDA Programming and Performance	5	2275	August 6, 2013
Many threads updating a single global variable CUDA Programming and Performance	7	6736	March 30, 2012

NaN

Related topics