What is the best option to setup on premise GPU cluster for a small company?

yakhyokhuja · October 2, 2024, 8:11am

In our company there are about 4-5 AI Engineers who is considering building a GPU server on premise to train smaller LLMs (<10B).

Now we are wondering choosing 2 x A100 40GB or for the same price we could buy about 10 x RTX 4090.
From above two options for a long-term, which one could you recommend and why you recommend that option?

When I checked RTX 4090 is much faster in terms of clock speed but still not sure how it will perform for training a deep learning models.

Would appreciate any suggestions, ideas, experience and any reference links to consider.

Thank you for the responses in advance.

njuffa · October 2, 2024, 8:58am

A few thoughts:

For how long do you plan to use this system? Generally speaking, the useful life span for a server might be 5+ years, at 100% utilization, i.e. 24/7 operation. You may want to consider relevant tax laws regarding the depreciation schedule for computers acquired for business purposes in making that decision.

Given the high cost (= high financial risk) of acquiring a high-end server I personally doubt that building one’s own system is wise. It would probably be best to chose a system from an NVIDIA-approved system integrator. I used to build high-end PCs in the past, but at some point figured the smarter option is to buy systems from Dell that are ready to run. I have not regretted that step yet, and it has been many years. I pay a bit more, but I do not need to maintain expertise in tricky system configuration issues, and have peace of mind when it comes to damaging expensive components (I fried DRAM and hard disk controllers, for example).

The RTX 4090 is a consumer part and as such not designed for a 100% duty cycle. It may not last for the projected lifetime of the system when operated at 100% duty cycle. I am not sure that system integrators approved by NVIDIA would configure a server with such a GPU, but it won’t hurt inquiring about it. Best I know, NVIDIA sells and supports A100 only through system integrators and provides no direct support to individuals who try to roll their own A100-accelerated server.

rs277 · October 2, 2024, 6:04pm

While I agree with njuffa’s comments, a “middle” road might be to consider a solution based on the RTX6000 48GB.

There are a number of well written evalutions around what it sounds like you’re trying to achieve here and I have no connection to the company.

Topic		Replies	Views
Specification for AI / Deep learning specifically for Computer Vision and Image Processing CUDA Programming and Performance	5	784	October 9, 2023
Which one is more suitable for my needs? A100 or 4090? CUDA Programming and Performance	12	41064	January 29, 2024
Need help with installing a GeForce RTX 4090 and an RTX A6000 CUDA Setup and Installation	12	4897	January 14, 2023
Seeking Advice on Choosing an AI Processing Unit for Versatile AI Applications Jetson AGX Orin generative_ai	3	481	August 14, 2024
Why are GPU so memory bound? CUDA Programming and Performance	3	2478	January 22, 2023
Newbie Question About the Best NVIDIA On-Premises Solution To Give The Highest Speed For My Deep Learning Training and Inference Miscellaneous Products (archived) cuda , tensorflow , python	3	1000	June 8, 2022
Which NVIDIA GPUs are more suitable for high-performance computing? CUDA Programming and Performance	33	2578	November 13, 2024
Upgrade Advice: More VRAM or Newer GPU Parabricks hw , cuda , ai	3	60	May 27, 2025
Difference between A100 vs RTX 4090 in training deep learning models TensorRT cuda , python	2	619	November 30, 2024
RTX A6000 ADA - no more NV Link even on Pro GPUs? Raytracing	23	27344	November 28, 2024

What is the best option to setup on premise GPU cluster for a small company?

Related topics