What is the best option to setup on premise GPU cluster for a small company?

In our company there are about 4-5 AI Engineers who is considering building a GPU server on premise to train smaller LLMs (<10B).

Now we are wondering choosing 2 x A100 40GB or for the same price we could buy about 10 x RTX 4090.
From above two options for a long-term, which one could you recommend and why you recommend that option?

When I checked RTX 4090 is much faster in terms of clock speed but still not sure how it will perform for training a deep learning models.

Would appreciate any suggestions, ideas, experience and any reference links to consider.

Thank you for the responses in advance.

A few thoughts:

For how long do you plan to use this system? Generally speaking, the useful life span for a server might be 5+ years, at 100% utilization, i.e. 24/7 operation. You may want to consider relevant tax laws regarding the depreciation schedule for computers acquired for business purposes in making that decision.

Given the high cost (= high financial risk) of acquiring a high-end server I personally doubt that building one’s own system is wise. It would probably be best to chose a system from an NVIDIA-approved system integrator. I used to build high-end PCs in the past, but at some point figured the smarter option is to buy systems from Dell that are ready to run. I have not regretted that step yet, and it has been many years. I pay a bit more, but I do not need to maintain expertise in tricky system configuration issues, and have peace of mind when it comes to damaging expensive components (I fried DRAM and hard disk controllers, for example).

The RTX 4090 is a consumer part and as such not designed for a 100% duty cycle. It may not last for the projected lifetime of the system when operated at 100% duty cycle. I am not sure that system integrators approved by NVIDIA would configure a server with such a GPU, but it won’t hurt inquiring about it. Best I know, NVIDIA sells and supports A100 only through system integrators and provides no direct support to individuals who try to roll their own A100-accelerated server.

1 Like

While I agree with njuffa’s comments, a “middle” road might be to consider a solution based on the RTX6000 48GB.

There are a number of well written evalutions around what it sounds like you’re trying to achieve here and I have no connection to the company.

1 Like