As I said, sudden reboots under high and rapidly changing load strongly point toward to a power supply issue. If you can run just fine when using a single GPU (e.g. by blocking one via
CUDA_VISIBLE_DEVICES), that would be further confirmation of this working hypothesis. My understanding is that the GPU power limits are enforced on the scale of seconds, which is not sufficient to prevent short-duration spikes (see more below). Note that the issue could be due to CPU power draw as well.
I asked for the total hardware configuration (CPUs, system memory, mass storage) to get an idea of the total nominal power consumption. By my guess of a typical system, a 1500W PSU should be sufficient for a system with dual RTX 3090s, but that may no actually be the case depending on specifics. It is also possible that there are imbalances in the power supply, e.g. incorrect distribution of loads across PSU rails. You might want to have the system examined by a local expert to find potential power distribution issues, this is not really doable remotely. My rule of thumb is that rock solid operation requires that total nominal power consumption should not significantly exceed 60% of the nominal power output of the PSU (power supply unit).
The reason for this large margin is that modern CPUs and GPUs use dynamic clocking and dynamic voltage levels, and as computational loads change, this can cause rapid and massive changes in instantaneous power draw. The corollary is that there will be short duration (millisecond range) power spikes that far exceed the TDP (or essentially similar) power ratings associated with these processors. These are based on power draw averaged over longer time frames on the order of minutes, and are primarily important for the design of cooling solutions (TDP = thermal design power).
Especially more cheaply constructed PSUs have a tendency to be overwhelmed by massive power spikes in high-performance systems. This can happen in two ways: The sudden increase in power draw can cause a temporary voltage drop (“brown out”) on the motherboard, which leads to unstable operation of electronic components (higher switching speeds require higher voltage) so the motherboard negates the “power good” signal which causes a reboot. The other scenario is that the PSU itself detects a potentially destructive overload and shuts down.
I do not consider a PSU with 80 PLUS Silver rating (ike the one you linked to) suitable for modern HPC/AI systems. My strong minimum recommendation for an HPC/AI workstation is 80 PLUS Gold, preferably 80 PLUS Platinum. My recommendation for an HPC/AI server is 80 PLUS Platinum, preferably 80 PLUS Titanium. These higher-certified PSUs offer higher efficiency, which means more of the power draw at the wall outlet (which can be quite limited in the US due to our 110V AC system) is available to the system. These PSU typically also offer higher reserves, better component quality and build quality, and longer manufacturer warranties.