I want to process 10 billion data for filtering, sorting, and aggregation, and query some specific data within seconds.
Right now I have several choices, e.g. A40, A100, or 4090. I want to know which one is better for me. Indeed, I don’t need the 3-party tensor support by A100, if I only consider CUDA cores and bandwidth, it seems 4090 is a better choice.
Go for A100, trust me, the CUDA cores is sufficient for a high intense computations. i have seen the use of 4080 and A100 80 GB for deep learning, even though 4080 is newer and gives more benchmark results, A100 was more useful due to VRAM Size. On 4080, speed may save you 1 or 2 seconds maximum relative to your data size
A100 is going to run you about $12000, if you can still find one. RTX 4090 will run you about $1800. If you need the 80GB of VRAM, there’s nothing quite like the A100 (except, perhaps, the 48GB L40 if that would be sufficient). The L40 will run you a pretty penny, too, but I don’t think as much as A100. It’s basically a commercial-grade RTX 4090 though with slightly less memory bandwidth.
One thing that people keep overlooking is the L2 cache size. A40: 6MB. Pooh. (And, it is about the most bandwidth-starved card in NVIDIA’s history: 700GB/s compared to its gaming alter-ego, the RTX 3090, at 940 GB/s or the Ampere line’s flagship A100 at 1950 GB/s.) And that A100’s got 80MB of L2. Wow. RTX 4090: 72MB. L40: 96MB. Double wow. The L2 cache is the lowest-latency coherent memory resource that the card has, and for “memory bound” FFTs (and perhaps sorting algorithms) the size of that resource is more important than the memory bus itself. So I’d go with the 4090, if the 24GB of memory is enough, or if you’re going to splurge I’d push you towards L40 before A100 or the new(er) H100.
Another thing to consider is that I’ve been scrutinizing a workflow that involves an intense sorting algorithm carried out with some of the CUDA libraries. In the limit of a lot of chunks to sort, the sorting speed of A40 comes to about 32 microseconds per chunk on the A40 (a “chunk” is 53000 atoms, but for the sake of this discussion it’s just a nominal, consistent amount of work). A100 sorts as fast as 29 microseconds per chunk, so 2.7x the memory bandwidth and 13x the global cache space comes to… not much. RTX 2080Ti sorts the same data as fast as 36 microseconds per chunk, so an old card that tends to be about 30% slower than the A40 and 70% slower than the A100 on general compute-intense problems is only about 12-25% slower for my sorting problem. The sorting definitely reaches some limiting performance, whereas if this kind of sorting were L2-bound I’d expect to see throughput drop as more and more chunks overwhelmed the resource. Other types of sorting may be L2-sensitive (it certainly seems like you’d want a global cache to put things in), but the problem may be more one of the speed at which things can get imported to the __shared__ partition of L1 and then rearranged.
Others can comment, but this aspect of the cards may have had relatively incremental improvements over the past five years–it seems to me a case of GPUs getting bigger (more SMs), not faster (each individual SM still has about the same data bandwidth to and from the global resources–the bus grows in proportion to the number of SMs, give or take).
As for linking 4090s, I think that they can be linked by NVLink (the workstation I just bought has an RTX 4090 and a 4080, which is for testing purposes on individual cards, and I had to ask the company specifically for that configuration against their statement that this configuration would not permit calculations spanning both cards (so, by implication, the company does claim that two 4090s can simultaneously carry out the same calculation).