10*A6000 or 10*A40 for training large language models?

Hello,

My advisor’s research lab is looking to buy a Deep learning server, but we’ve been having a hard time choosing between whether to use the A6000 GPUs or the A40 GPUs.

Some facts:

  • There are about 10 people in the lab, all working on deep learning based research.
  • Server will be installed in a server room maintained by university IT, so ambient temperature / power shouldn’t be an issue.
  • Current config allows 10 GPUs on one server.

I’ve looked through a couple of blog posts and here is what I understand:

  • A6000 and A40 belong to the same family of GPUs
  • A40s have passive cooling, while A6000s have active cooling.
  • A6000s have an overclocked memory bus => they are slightly faster at inference time.
  • A6000s also have a smaller form factor => we can add more GPUs to the server if needed (?)

Question:

  • Is passive cooling going to be a problem if the machine runs at full capacity?
  • Do we risk bottleneck-ing the A40s / A6000s in any way by putting them all in one big server?
  • Is there any difference in the software maintainability of the A40s v/s the A6000s? Specifically, are we prone to running into more configuration issues on the A6000s vs the A40s?

Apologies for the laundry list of questions!