I have a big float arrays and I have to do an update only on certain elements.
At the moment I have a second byte array where the value is set to 1 if there is something to do and 0 if not. The algorithm has to run through this mask array to determine the elements which have to be updateted. The percentage of updates is quite low maybe 5 % percent and this value is changing during the algorithm. It save me 36 memory reads using this mask on elements that don’t need an update.
Is there a better and faster way to do that in parallel? Working with device memory atomics is a bad idea … in my expierience. Shared memory atomics might be an idea?
Are there any other ideas? A parallel list implementation?