Re PSU sizing. The nominal wattage for electronic components is typically stated as TDP (thermal design power) or something essentially equivalent. This is power draw averaged over long periods of time (across several minutes). It is needed for determining the correct dimensions of the thermal solutions, thus the name. TDP doesn’t tell us anything about instantaneous power requirements (across a few milliseconds), which with modern CPUs and GPUs can be significantly higher than nominal. In my experience, a large safety factor is therefore needed for rock solid 24/7/365 operation under possibly changing environmental factors such as ambient temperature, so my standing recommendation is to make sure that nominal power for the system does not significantly exceed 60% of the nominal PSU wattage.
I usually suggest 80PLUS Titanium for servers with large power draw because it
(1) gives the greatest power supply “head room”, which can be important in environments where power is limited by the amperage of the circuit breaker. E.g. a typical residential electrical outlet in the US cannot supply more than 15A @ 120V. The more efficient the PSU, the higher the percentage of that power actually available to the system. Are you running the 2000W Super Flower PSU off 230V mains?
(2) converts the least amount of power to useless heat before it ever gets to the system, reducing electricity costs. With a system like yours operating around the clock, you are looking at 50kWh per day, so assuming 90% PSU efficiency, the intra-PSU loss is 5kWh per day, or (depending on electrical tariff) about $1.20 to $1.50 per day where I live.
Re symptoms: The “GPU freeze” scenario is not clear yet. Here is a hypothetical scenario: You boot the system, run
nvidia-smi and it happily reports it can see four GPUs, all idling. Then you start your deep learning software, and some minutes into that the GPUs stop making forward progress. At this point you run
nvidia-smi again, and it cannot see any of the GPUs. You check
var/log/messages and see multiple error messages like
NVRM: Xid (PCI:0000:06:00): 79, GPU has fallen off the bus. Is that what you are observing?
If so, the most likely root cause is inadequate power supply to the GPUs. There are only a few scenarios where a GPU can fall of the bus: (1) (poor quality) risers cards degrading PCIe signal quality (2) GPU not waking up correctly after suspend-resume cycle (3) PCIe link operation negatively impacted by ACPI (4) defective GPU (5) insufficient power supplied to the GPU. All but the last one are rare.
Lastly, I have same server which has the different GPUs(four of Titan RTX) and Coolers (liquid cooling), and it works well under full load
If both machine were configured identically in all aspects, I agree that both should work when one works. “In all aspects” means “copy exactly”, down to cabling, BIOS versions, etc. Clearly, you do not have that here as you have different GPUs in the two systems. They may have different requirements, e.g. for power and PCIe address space. One thing you can try is a cross check: exchange the GPUs between the working system and the problematic system and see what happens. I would also suggest you start with a small configuration with one GPU (which hopefully works) and work your way up to a four-GPU configuration.
You would want to work very methodically, carefully noting whether any issues correlate with a particular GPU, PCIe slot, or system.