Current best practices for controling global memory updates


I am relatively new to CUDA and I hope this question isn’t inappropriate.

I am curious if an established best practice exists for limiting the times a value contained in global memory gets updated. For example, if a global variable is initially set to zero one might imagine a kernel method that reduces the number of times an update of a global variable takes place to look something like…

if(global_var[i] == 0 && new_value != 0)
global_var[i] = new_value;

And if that this flow control was hit many times that one could use a shared memory variable to help reduce the likelihood of global values being read or written to many many times…

if(i < n)
shared_copy[tx] = global_var[i];

… then later use the shared value for flow control…

if(shared_copy[tx] == 0 && new_value != 0)
global_var[i] = new_value;

My question is: after testing code similar to this it is clear that this flow control method doesn’t work - but do other methods exist?

Your question is slightly confusing.

Are you asking whether a bunch of warps are addressing the same global variable at the same time run into a conflict and how to resolve that?

Or are you asking how to reduce the number of times you change a global variable?

That being said, a kernel should only ever interact with a global variable/vector once. Either when loading that data into the shared space, or moving it from the shared space into global. (Assuming you have the shared space to hold the global data.)

If you are talking about addressing the same global variable over the span of multiple warps you should look into atomic operations so that you don’t run into any data race conditions.

Good luck,


*As a note, I mainly code for OpenCL but have a general knowledge of CUDA and the similarities between the two. If I said something incorrectly regarding CUDA please let me know and I’ll alter my post for that.


Appreciate your asking for clarification: I’m trying to reduce the number of times a global variable is changed.

Here is some context -> A method must check many different cases across n threads. If one of case is true then the case will be true regardless of any false instance.

I understand that one solution to the method could be to simply add either 1 or 0 to the global variable without flow control (adding zero (or false) to 1 (or true) will still result in 1 (true)) and that the addition could happen n times but it seems like a waste to perform needless writes if true were found prior to the end of n checks - thus the question.

I have similar versions of this problem logic in other methods doing different work.

Does that make sense?

Okay I think that makes more sense. I think in this case, race conditions might not really matter to you.

Overall, I think that your approach should be that in every case that a thread finds a true case, it should change the global variable to true. Even with race conditions it will remain true.

That being said, conditionals (ifs/elseif/elses) inside of GPU kernels aren’t great, especially when you have branching (when threads go to different portions of code) and can cause slow downs. AFAIK, most kernel compilers compile assuming that the if statement will return true and therefore it is generally taught that you should write conditionals to return true more often than not.

Therefore, you could set up an additional if statement checking whether the global has been altered yet, but be aware that this causes added branching.

Another option is where you could do something similar to a reduction sum. You find the value of each of the cases across the threads and then using a reduction sum, add the 1s and 0s. Afterwards you apply that to the global variable and you should get a result that is either 0 or positive giving you an answer. This might end up being more computationally involved though.