Which one is more suitable for my needs? A100 or 4090?

yanzhuozhou96 · May 10, 2023, 10:46am

Hi there,

I want to process 10 billion data for filtering, sorting, and aggregation, and query some specific data within seconds.

Right now I have several choices, e.g. A40, A100, or 4090. I want to know which one is better for me. Indeed, I don’t need the 3-party tensor support by A100, if I only consider CUDA cores and bandwidth, it seems 4090 is a better choice.

Could you offer me some advice?

Thx

cbuchner1 · May 10, 2023, 11:03am

my recommendation is to get the GPU with the most VRAM

striker159 · May 10, 2023, 11:50am

It all depends on how much memory you need at once.

10 billion floats take around 37 GB memory. Sorting all elements together requires double the memory.

The 4090 is faster than A100, but has 24GB instead of 80 GB.

yanzhuozhou96 · May 11, 2023, 2:16am

For each node, I assign 2 billion database rows, 24GB may be sufficient?

striker159 · May 11, 2023, 7:23am

I cannot decide that for you.

bilalomar729 · November 11, 2023, 8:05am

Go for A100, trust me, the CUDA cores is sufficient for a high intense computations. i have seen the use of 4080 and A100 80 GB for deep learning, even though 4080 is newer and gives more benchmark results, A100 was more useful due to VRAM Size. On 4080, speed may save you 1 or 2 seconds maximum relative to your data size

dscerutti · November 11, 2023, 9:54pm

A100 is going to run you about $12000, if you can still find one. RTX 4090 will run you about $1800. If you need the 80GB of VRAM, there’s nothing quite like the A100 (except, perhaps, the 48GB L40 if that would be sufficient). The L40 will run you a pretty penny, too, but I don’t think as much as A100. It’s basically a commercial-grade RTX 4090 though with slightly less memory bandwidth.

One thing that people keep overlooking is the L2 cache size. A40: 6MB. Pooh. (And, it is about the most bandwidth-starved card in NVIDIA’s history: 700GB/s compared to its gaming alter-ego, the RTX 3090, at 940 GB/s or the Ampere line’s flagship A100 at 1950 GB/s.) And that A100’s got 80MB of L2. Wow. RTX 4090: 72MB. L40: 96MB. Double wow. The L2 cache is the lowest-latency coherent memory resource that the card has, and for “memory bound” FFTs (and perhaps sorting algorithms) the size of that resource is more important than the memory bus itself. So I’d go with the 4090, if the 24GB of memory is enough, or if you’re going to splurge I’d push you towards L40 before A100 or the new(er) H100.

Another thing to consider is that I’ve been scrutinizing a workflow that involves an intense sorting algorithm carried out with some of the CUDA libraries. In the limit of a lot of chunks to sort, the sorting speed of A40 comes to about 32 microseconds per chunk on the A40 (a “chunk” is 53000 atoms, but for the sake of this discussion it’s just a nominal, consistent amount of work). A100 sorts as fast as 29 microseconds per chunk, so 2.7x the memory bandwidth and 13x the global cache space comes to… not much. RTX 2080Ti sorts the same data as fast as 36 microseconds per chunk, so an old card that tends to be about 30% slower than the A40 and 70% slower than the A100 on general compute-intense problems is only about 12-25% slower for my sorting problem. The sorting definitely reaches some limiting performance, whereas if this kind of sorting were L2-bound I’d expect to see throughput drop as more and more chunks overwhelmed the resource. Other types of sorting may be L2-sensitive (it certainly seems like you’d want a global cache to put things in), but the problem may be more one of the speed at which things can get imported to the __shared__ partition of L1 and then rearranged.
Others can comment, but this aspect of the cards may have had relatively incremental improvements over the past five years–it seems to me a case of GPUs getting bigger (more SMs), not faster (each individual SM still has about the same data bandwidth to and from the global resources–the bus grows in proportion to the number of SMs, give or take).

njuffa · November 11, 2023, 10:05pm

I haven’t been following this closely, but I was under the impression that H100 is still pretty much unobtainium for retail-level purchasers?

dscerutti · November 11, 2023, 10:17pm

Unobtainium, indeed. $25-30k per card. And a PSU that can feed a server full of the 700W cards. Osmium-plated grills, 24k gold rack doors.

astokes · November 17, 2023, 12:22am

Are they linkable, i.e. would 4 of them work better than an A100?
Any good for deep learning, or do you need the tensor cores?

dscerutti · November 17, 2023, 1:21am

RTX 4090 has tensor cores. I’m not sure whether the tensor support in A100 is more robust or efficient, but one RTX 4090 has more tensore cores (512), of some type, than A100 (432).

As for linking 4090s, I think that they can be linked by NVLink (the workstation I just bought has an RTX 4090 and a 4080, which is for testing purposes on individual cards, and I had to ask the company specifically for that configuration against their statement that this configuration would not permit calculations spanning both cards (so, by implication, the company does claim that two 4090s can simultaneously carry out the same calculation).

rs277 · November 17, 2023, 2:04am

It looks like Ampere was the end of NVlink on consumer cards:

rs277 · January 29, 2024, 6:52pm

Supplementary info:

Topic		Replies	Views
Which NVIDIA GPUs are more suitable for high-performance computing? CUDA Programming and Performance	33	811	November 13, 2024
Need help with installing a GeForce RTX 4090 and an RTX A6000 CUDA Setup and Installation	12	4662	January 14, 2023
Why 2RTX 2080ti run slower than 2Tesla P100？ CUDA Programming and Performance	17	5242	July 6, 2019
RTX A6000 ADA - no more NV Link even on Pro GPUs? Raytracing	23	24431	November 28, 2024
Should I buy Tesla or GTX295 CUDA Programming and Performance	9	4733	January 22, 2010
Advice on first CUDA system CUDA Programming and Performance	13	2682	July 7, 2009
How to correctly write code to test A100 L2 bandwidth？ CUDA Programming and Performance	6	1885	October 17, 2023
Why are GPU so memory bound? CUDA Programming and Performance	3	2275	January 22, 2023
CUDA development cluster (using old filing cabinet!) Advice needed on hardware specification CUDA Programming and Performance	38	10339	October 4, 2010
Fermi? Sounds interesting... CUDA Programming and Performance	58	15506	October 18, 2009

Which one is more suitable for my needs? A100 or 4090?

Related topics