Uint64_t result evaluation & storage eats up 25% of kernel performance

What kind of operations do you measure(Gop/s)?
What Gpu are you using?
Please provide a fully working example code that others could use to benchmark.

Some thoughts:

The second atomic kernel has a race condition with *hitsn. Multiple threads could see the same value.

To me, this looks like it could be implemented with Thrust, using a combination of transform iterator + copy_if