I have a function in my kernel in which more than all threads from the thread block are adding a float value to a single memory location in shared memory. Then that single location in shared memory is added to a single location in global memory, can this cause a NaN value? should I use atomicAdd()?
Two topics in your question:
-
If you add a NaN to any other value the result is NaN. So the “NaN-ness” is sticky, and likely any NaNs in your source data will end up with your results being NaN
-
If you’re just looking to sum many values from different threads, that can be done with atomics, but please study the “reduction” example in the SDK which shows the most efficient way. Reduction is a common pattern in CUDA, it’s important to understand it well.
In your specific case, it sounds like you’re only adding one value per block to a global accumulator, and for that atomics might indeed be most efficient since you’re likely adding a few hundred values or whatever. But if you have 500 threads adding their results to a single shared memory location, that’s likely best done with a reduction.
So, my main question here is: if multiple threads add to a single memory location simultaneously is that guarded by the hardware OR that will cause invalid values to be stored in that location?
Yes. An atomic add of a NaN to memory location should result in an NaN.
Of course I haven’t tested it…
If you write to the same memory location with += by all threads then you are guaranteed to get garbage results from the race condition.
The hardware only performs adds (or other operations) atomically when you explicitly use the atomic* functions to do so.
Thanks MisterAnderson42. So, reduction is not possible in global memory if the hardware does not support atomicAdd?
Because my ultimate goal is to sum each single value from each thread block to a single location in global memory.
I have Quadro FX 5600 and I guess it is compute 1.0 which doesn’t support atomicAdd. Is that right?
Thanks.
Yes, that’s correct. The FX 5600 is compute capability 1.0, which doesn’t support atomic instructions.
You can do global reduction even on 1.0 hardware just by using kernel launches as barriers. You probably only have 100-10000 values to add (you said it’s one value from each block) so just launch a new kernel to do the reduction on those. Kernel launches are cheap, only 15us overhead or so, so even though you “waste” a lot of MPs which are idle, it’s not really a big deal since the final reduction step is small. Most of your effort will be in the initial reduction where you’re efficiently using all your bandwidth doing the initial per-block reduction.
Depending on your application, it may be useful to skip the last reduction pass and just do it on the CPU, but often you want the result on the GPU so you may as well do the compute there.
So in summary:
-
First kernel: Lots of blocks, say 500. Each one does its own reduction on a chunk of data, and writes the final per-block value to global memory. You now have 500 values written in device memory.
-
Second kernel. Just one block! It reads 500 values into shared memory, does a reduction, and writes out one single number (your answer).
There’s a mental unpleasantness about that second kernel (“What a waste, most of your GPU is idle!”) but in practice it’s really a small overhead indeed, negligable compared to kernel #1 if it has enough work to keep itself busy.
IIRC the Programming Guide states that in a race situation, one thread is guaranteed to perform a successful write but it’s undefined which one. So it’ll be garbage from the algorithmic point of view but it shouldn’t be a “random bit” king of garbage that might result from clashing writes.
Yes. This is exactly what one should do. HOOMD has several kernels like this. You also need this technique if you are reducing floating point variables.
Your clarification is entirely correct. I’ve found typically with intential += race conditions, the resulting value always ends up much smaller than it should be, confirming this.