I have a GPU kernel that uses the atomicAdd_system call. This kernel was working on multi-GPU systems (with V100) but is failing on a system with 4 4070 GPUs.
I am compiling with compute capability sm_75 and sm_86 and using dev kit 11.8.
Thank you for the response. I tried with sm_89 and am getting the same result. It’s weird, my code works fine with multiple V100, multiple 3070s, multiple 3080s.
The new 4070 ti seem to be problematic and don’t execute the atomicAdd_system() function properly.
This is code that I have written and was able to examine the output of atomicAdd_system() in gdb. This same code works on other multi-GPU systems. I just verified it again on another machine with 2 3070s and the same code runs properly on that system.
Failing how exactly? Spontaneous system re-boot? An error message printed to a system log? If so, what is the error message?
When asking for assistance in resolving some issue, it is highly advisable to post a minimal self-contained program that reproduces the issue and that others can build and run. It is also advisable to mention system configuration (e.g. what OS is this), compiler switches, etc. The first step in problem resolution is independent reproduction of the issue. Fun fact: In 50% of cases trying to create a minimal reproducer actually reveals the root cause.
Sometimes a piece of software happens to work for quite some time, but fails after a hardware or software change, because it did not actually work by design, i.e. it actually contains a dormant bug that eventually manifests itself.
When debugging issues it is important to run controlled experiments where only one variable changes at any one time. E.g. take the system where the app works with two RTX 3070s, unplug those cards, then plug the RTX 4070s into the same PCIe slots, with no other changes. Does the issue (whatever it is) manifest?
It is failing in the sense that the GPU where the atomicAdd_system() is running is reading 0 values when accessing managed memory (that is likely resident on another GPU). There is no crash nor reboot. I just get wrong results for the addition.
I will write a test case and post but that will take time. The code I am working with is huge. It has been working for a while – on 4 GPU machines, 8 GPU machines, etc. As I said earlier, I have run this on different GPUs and it works except for the 4070 ti devices.
Yes, I have done exactly as you suggested, w.r.t. swapping out the cards, etc. For e.g., I ran my software on a system with 2 or 4 4070 ti cards in it and I see the failure. I also have 2 3070 cards – I power down the machine, replace the GPUs with the 3070s and then run the same software and it works. The cards are placed in the same slots, etc. Incidentally, the failure occurs with 2 4070s as well.
In terms of the machine specification, I am run Linux 20.04 on a ASUS motherboard. There is no SLI or NVLINK. The toolkit version is 11.8 and I build the software for SM75, SM86 and SM89.
I tried the application with two different driver versions: 525.89.02 and 525.78.01 and get the same behavior as described.
Does your usecase fullfill the requirements for atomicAdd_system? Memory model | libcu++
Atomicity
An atomic operation is atomic at the scope it specifies if:
it specifies a scope other than thread_scope_system, or
the scope is thread_scope_system and:
it affects an object in unified memory and concurrentManagedAccess is 1, or
it affects an object in CPU memory and hostNativeAtomicSupported is 1, or
it is a load or store that affects a naturally-aligned object of sizes 1, 2, 4, or 8 bytes on mapped memory, or
it affects an object in GPU memory and only GPU threads access it.
Refer to the CUDA programming guide for more information on unified memory, mapped memory, CPU memory, and GPU peer memory.
You say that you are reading 0 values. Do you mean the return value of atomicAdd_system ? How do you know that other GPUs have written to this memory before?
I am not super into the C++ memory model, but according to the docs atomicAdd_system has semantics of memory_order_relaxed. Maybe you need a memory fence to be able to observe values written by other devices.