I have a big float arrays and I have to do an update only on certain elements.
At the moment I have a second byte array where the value is set to 1 if there is something to do and 0 if not. The algorithm has to run through this mask array to determine the elements which have to be updateted. The percentage of updates is quite low maybe 5 % percent and this value is changing during the algorithm. It save me 36 memory reads using this mask on elements that don’t need an update.
Is there a better and faster way to do that in parallel? Working with device memory atomics is a bad idea … in my expierience. Shared memory atomics might be an idea?
Are there any other ideas? A parallel list implementation?
I am not sure I follow. You are using what is effective a permutation matrix in your kernel to determine which matrix elements should be “updated” (whatever that is). But surely the permutation is know a priori, or is it being generated dynamically as well?
You can see it as permuation matrix and it is generated dynamically.
I’m searching for an approximately shortest path in a 3D array with the ford-bellman algorithm and I know in each iteration which elements need an update (reading all neighbors, and writing the minimum to the element). This is a part of an image segmentation algorithm.
When I update an element I trigger an update in the next iteration of its neighbors with the byte mask (setting the elements to 1 in the permutation matrix).
The distances are propagating in circular waves if you have an 2D image and you set the starting point in the mid.
There are some cases where two threads want to update a cell of an array and it doesn’t matter which thread’s value is the final value.
This would particularly be the case where they are both updating it with the same value.
e.g. threads 4 and 6 of a block both want to update cell[r][c] to 6.023
or simply it really doesn’t matter if it ends up as 63 or 117 just so long as it is one of them.
Question is, is it OK to allow code to do this on the GPU, or is it dangerous and might lead to mangled values.
It would be ok, but your algorithm won’t be deterministic. The result might be different for the same data in every new run of your algorithm. This is quite horrible to debug, but otherwise it should be ok.