I’m doing a hough transformation from on an image, foreach edge pixel
iterate the buffer (dst) on the according sin/cos line. Because many edge pixel can
map to the same hough space (buffer) coordinate i need atomic inc, or memory gets
overriden by another thread.
so how did it work before without the atomic operations? can you post the kernel ? any ways one thing that can cause the problem is if you try and do global writes based on these atomic operations. Since global writes are not actually done when your code is executed, but cued up and flushed once in a while. hopefully soon Nvidia will expose a global memory flush command.
In most cases such problems occur in case of faulty kernel (memory bounds, compilation problems, etc).
My experience tells me that an argument “but it works with non-atomic” is rather weak, because faulty kernel can often lead to an absolutely unimaginable behavior.
I would recommend the following:
Try to run on a different device (if the problem remains - than your device is good)
Try to exclude as much code as you can (except faulty part) from the kernel, to localize the problem
If in the result you will get a very simple kernel with about 100M atomic calls which reproducibly fails - you should better post this kernel here. I think in this case guys from Nvidia will show their interest in this.