I would like to share my sample code here which exposes erroneous behaviors of multi-4090 systems.
https://github.com/mino-hidetoshi/System-wide_Atomic
As far as I tested, it produces non-deterministic results when running on multi-4090 system as follows:
Compilation:
$ nvcc -arch=sm_60 atomic-01.cu
Execution examples:
# single 4090 execution
$ CUDA_VISIBLE_DEVICES=0 ./a.out
atomic-01.cu
5000000
# dual 4090 execution
$ CUDA_VISIBLE_DEVICES=0,1 ./a.out
atomic-01.cu
5000269
I would like to know if this is common to all multi-4090 environments.
I have tested some instances on cloud GPUs and got this non-deterministic results WITH NO EXCEPTION. How about yours?
Multi-3090 systems ( or older ) always produce the right result of 5000000.
---- added on 2023/02/19
In the sample code, kernel updates managed variables repeatedly by atomicAdd_system(). If the atomic operation works correctly, the result should be 5000000. If the atomicity is broken, the results become unpredictable.