cuda atom plus a limited number of operations

I now have 60 * 1024 * 60 * 1024 thread, To 1500 * 1500 size memory write data. Each thread will be on the size of the memory 1500 * 1500 each position once addition, the final result is 60 * 1024 * 60 * 1024 result of the addition, so I need to use atomic addition operation here, to ensure that each thread Addition operating results are enumerated.
I now find I run the program always exceptions, when I put the smaller scale of the problem and found that it can be executed.

You posted in the wrong board if you’re seeking programming-related answers. It would be helpful if you are able to post an example to reference what the issue is you’re experiencing, what errors if any you get, etc. In other words, be precise and detailed.