First GPU server advice

mo976 · August 8, 2019, 12:52pm

I run the software engineering lab at a College and of course over the past number of years as AI has taken off so has the number of students working on AI related final projects (mostly projects based around TF but also others). What we currently have is a dozen or so stand alone "workstations" running a mix of single 980/1080 Ti cards and this sometimes means doubling up projects on one PC. The department head now wants to buy something scale-able a GPU server, each project with their own VM that could make use of available resources rather then limited one VM one GPU. But I am a bit overwhelmed and unfamiliar regarding Host OS options, hypervisors, Nvidia vGPU licensing…
The server they want is the Asus ESC8000 G4 https://www.asus.com/us/Commercial-Servers-Workstations/ESC8000-G4/ this is bare meta and it (currently) does not support esxi which we use for our other servers.
The goal for me most likely being responsible for setup and implementation is OS familiarity, minimal headache and learning curve and to keep prices down.
Thanks for bearing with my long post.

MrGRID · August 8, 2019, 4:04pm

Hi

For simplicity, support, stability, management, compatibility, scaling, general information and all the other benefits associated with enterprise grade hardware, you should be looking at something from one of the big OEMs (Dell, HP, Cisco etc) as your core enterprise infrastructure where you’re supporting hundreds / thousands of concurrent students workloads 24x7x365.

As you work in EDU, you should be entitled to significant EDU specific discounts from the OEMs (including NVIDIA) which should make life easier and save you going for less appropriate brands. If you purchase your hardware and NVIDIA licenses from the same vendor (Dell for example) you can negotiate much better discounts.

AI isn’t just about the GPU, the whole platform needs to be an end-to-end solution or you will have performance bottlenecks. Storage and network are also crucial to avoiding these issues. You should be looking at nothing less than All Flash / NVMe storage for your workloads and 10GB+ networking depending on where you run the workloads from.

As for GPUs, you’ll be wanting V100 and / or T4 GPUs. As you’re working with AI, you’ll want QvDWS licensing which is licensed per concurrent user (so every concurrent (not named) user will need a QvDWS license).

Depending on your requirements, you may want a single extremely high powered server (DGX-1 for example, which has 8x V100s running in NVLink and some very special software)) and then a more granular set of servers to cater for student density (maybe a Hyperconverged infrastructure that combines high performance Storage and Compute with linear scaling and easy management, something like Cisco Hyperflex for example combined with multiple T4s per Node). Or if you want individual Nodes, then a Dell R740 / HP DL380 / Cisco C240 all with multiple T4s attached to a high performance All Flash storage appliance would suffice.

If a DGX-1 is a slightly scary proposition due to its software stack, then you could go for something like the "EMC DSS8440" from Dell which supports up to 10x V100s where you could simply drop ESXi on to it and then virtualise each of the V100s to provide multi user support per GPU. With the latest NVIDIA vGPU software, Multi-vGPU support is available so you could connect multiple V100s into a single VM for added performance if required. Local storage available on this server is both SSD and NVMe and the CPU options currently are dual Xeon Platinum 8168 2.7G, 24C/48T. The slight caveat here is that (unlike the DGX-1) this server is PCIe and doesn’t support NVLink and it also doesn’t support more than 1TB of RAM currently. Cascade Lake Scalable Xeons should be coming for it towards the end of this year / early next year, when hopefully Dell will allow more RAM to be added as well. However, it’s a great bit of kit to get you started if you wanted to go down that route, and then for greater user density scale out to other infrastructure (as mentioned above) using T4s.

With respect (and having personally seen it a few times from customers) … You say you want to keep costs down (just like everyone else) but if you cheap out on the platform to start with, it’ll cost you more in the long term by delivering poor performance, limited component support, functionality and scalability (you’ve already mentioned that ASUS doesn’t support ESXi) and most importantly a bad user experience.

Speak to the OEMs first, see what EDU discounts are available to you (you should be entitled to some) and go from there.

Regards

MG