System-wide atomic operation failure on multi-4090 systems

I would like to share my sample code here which exposes erroneous behaviors of multi-4090 systems.

https://github.com/mino-hidetoshi/System-wide_Atomic

As far as I tested, it produces non-deterministic results when running on multi-4090 system as follows:

Compilation:

$ nvcc -arch=sm_60 atomic-01.cu

Execution examples:

# single 4090 execution
$ CUDA_VISIBLE_DEVICES=0 ./a.out 
atomic-01.cu 
5000000

# dual 4090 execution
$ CUDA_VISIBLE_DEVICES=0,1 ./a.out 
atomic-01.cu 
5000269

I would like to know if this is common to all multi-4090 environments.
I have tested some instances on cloud GPUs and got this non-deterministic results WITH NO EXCEPTION. How about yours?

Multi-3090 systems ( or older ) always produce the right result of 5000000.

---- added on 2023/02/19

In the sample code, kernel updates managed variables repeatedly by atomicAdd_system(). If the atomic operation works correctly, the result should be 5000000. If the atomicity is broken, the results become unpredictable.

I have now found this multi-4090 issue is docker image dependent.

Three examples I tested follow:

nvidia/cuda_11.3.0-devel-ubuntu18.04 : No good
nvidia/cuda_11.3.0-devel-ubuntu20.04 : Good
nvidia/cuda_12.0.0-devel-ubuntu22.04 : No good

Ubuntu22 image is known to be problematic.
Older versions also seem to be unreliable.

In summary, this is not likely a GPU issue but a docker issue.

Sorry , If I bothered you.