Has anybody experience with running a Multi-GPU workstation (4 GPUs likely) with deep learing workloads continuously in a 24/7 fashion, with Non-Tesla GPUs (Volta or Turing architecture) like “Titan V” or “Quadro RTX 5000/6000” ? Are these GPUs suitable for 24/7 processing - any thoughts ?
Although I cannot find anything on the NVIDIA website right now, various of NVIDIA’s partners advertise Quadros as supporting 24/7 operation. For example, PNY says:
I have been using Quadros of various kinds for many years, typically operating them close to a 24/7 compute load profile, and have never encountered any issues. I will note that I did not use any particular Quadro for more than three years before replacing it, that these were at most dual GPU machines, and that I have not used the new RTX GPUs at all.
From many posts in these forums one can see that home-built high-density GPU systems can run into issues with system BIOS support, cooling, and power supply, although these were often systems with more than four GPUs.
Rule of thumb for rock-solid power supply: nominal wattage of all system components combined <= 60% of nominal rating of the PSU (there are short-duration power spikes in both CPUs and GPUs, and some CPU vendors understate the power draw of their CPUs under full load). I would recommend an 80 PLUS Platinum compliant PSU. With four GPUs cranking 24/7 (and a nominal system power draw of about 1300 to 1500W, depending on the GPU used), you might want to check whether a more expensive 80 PLUS Titanium compliant PSU would pay for itself in reduced electricity costs within, say, two years.
Other rules of thumb: 4 CPU cores per GPU, system memory 4x total GPU memory, NVMe mass storage.
While I haven’t used the Titan V’s specifically, we regularly used the older Titans (and the higher end GTX models and RTX now) in single, dual and quad GPU systems running solidly for multiple days at a time. Generally we have not had any issues with stability, aside from the occasional dead card which dies within a few days of purchase but that is pretty much the standard irrespective of workload. Obviously mileage may vary.
We do make sure to have adequate cooling (preferring the exhaust style cards) as well as an over-the-top PSU just to make sure of stability as njuffa mentioned. We tend to go for 1500/1600W. Running any PC hard for multiple days at a time requires a stable system!
The other thing to keep in mind is that the code we run on them (ptychography, digital image reconstruction) is inherently robust against bit errors that could develop in memory. The more GPUs and the longer you run, the more likely you are to need the ECC memory.
In a regular office environment, 15A circuit breakers are typical in the US, which puts the high end of typical power supplies for workstation-class PCs at 1600W.
Note, however, that the Quadro RTX 6000 is specified with a power draw of 295W (https://www.nvidia.com/en-us/design-visualization/quadro/rtx-6000/). So if you put four of those in a box, a 1600W PSU is not going to cut it. Not even close. Add another 120W-130W for the CPU, 0.4W per GB of DDR4-2666, 6W per mass storage device, 10W for miscellaneous motherboard components (chipsets, network interfaces, etc). Assuming you are not skimping on system memory (a fairly common mistake in GPU-accelerated systems!), that will easily put the nominal power draw of the machine at around 1500W.
But now you need to add a PSU safety factor if you want continuous rock-solid operation. In my experience, the most common computer component to break is the PSU, followed by DRAM. The components in a PSU age under continuous load, and this happens much faster if the PSU is running hot, which is why you want maximum efficiency and load restrictions. I have managed to kill PSUs in as little as half a year by subjecting them to 80% load almost continuously. Admittedly my rule of thumb of 60% load on the PSU may seem excessively low to some, but if reliability is the goal, I would not want to deviate much from it (added benefit: this also exercises the PSU in its most efficient range, which is typically from 20% to 65% load).
I am not particularly timid when it comes down to slapping together PC components, but I think for a costly behemoth such as the system envisioned by the OP I would prefer to purchase a completely configured system from an experienced system integrator (check NVIDIA’s official partner list) who warranties the resulting complete system, to make sure everything is set up properly.