selective output versus redundant output which is a better theoretical approach

I was wondering what would be a better approach in general (no specific problem) to the issue of what to do when an output value is not necessarily produced for every input value.

i.e. if (input[i] < 10)
output[42] += 1

again I re-iterate that this is not a practical piece of code, just used to demonstrate what I mean (in terms of selective output). In this example the idea could be 'increment some global memory counter (at location 42) for every input element which is less than 10.

Obviously when having a condition like this in a kernel, memory accesses are very inefficient (ignoring the fact it’s only incrementing a value).

Redundant output in this case would (for example) output an array containing flags of true/false (less-than/greater-than) and then leave it to the host to filter these.

Which of these theories typically works best, not forgetting that fact that if you wanted to record more than 1 piece of information you could be looking at 4+ times the space required by the output when compared to input. (which feels very wasteful - obviously the bigger picture is the more important one)

Any thoughts on the subject?

Your example underscores how you can’t ask a generic question like that.

In your scenario, you’d have to be issuing atomic ops on the GPU. Which is hell-slow. It’d make perfect sense to simply output to an array and then go over it in a 2nd pass (on a CPU or even on the GPU).

In general though, you should think about the idea of divergent threads. When you tell threads to do something different for each one, they end up doing work redundantly for all. In the “general” case (if there ever was one), your two approaches are the same!