Low or normal performance?

Shouldn’t this be (*iters)++ ?
With that one, I get about 220.000 million kernel calls/second.
But still not what I expect … see below.

Ok, kernel calls/second might not be the usual parlance (sorry, I’m new in CUDA :), but how shall I phrase my “kernel calls/second” differently/correctly?

Have a look at this here (an example I found):
https://forums.developer.nvidia.com/t/bitslice-des-optimization/38896/48

How is it possible that this software does 23750 MH/s while my simple atomicAdd() code comes up with 1900 (whatever I should call it). Above thread talks about DES, which is far more complicated than one line of atomicAdd() or (*iters)++.

I must be missing something!