What kind of operations do you measure(Gop/s)?
What Gpu are you using?
Please provide a fully working example code that others could use to benchmark.
Some thoughts:
The second atomic kernel has a race condition with *hitsn. Multiple threads could see the same value.
To me, this looks like it could be implemented with Thrust, using a combination of transform iterator + copy_if