GPU Selection for custom code (no matrix operations)

Hi everyone,

I’m a retired programmer and I’m porting a personal C++ application to CUDA. It has no matrix operations, or graphics. It is simple math. I’m writing the code in order to minimize branching and such so to limit the cuda core interdependencies within each SM. The sweet spot currently is 48 threads per SM, which is 77% faster than trying to use all 128 cores. I do not know enough to identify if this is due to memory bandwidth or some other factor.

Regardless, I’m building a rig to run it. I’m choosing two RTX-5090s instead of an RTX 6000 Ada or a L40 type GPU, The A100 and H100 are out of my budget as this is a personal project.

Question: Can someone confirm that the focus of the professional GPUs is on special features such as tensors, etc. and that my use case would not necessarily sufficiently benefit from those given that it is much simpler.

Other than experience, is there another way to assess this?

Thanks!

If 48 threads is faster than 128 cores (and one should use a multiple of 128 to hide latencies like 512 threads per SM) and it is not just not more speed-up above 48, and you do not have direct interdependencies, than you probably fill up a cache level? L1 cache? Instruction cache (for very long programs)? You are using 1 block with so many threads or are using a individual block for each thread?

The professional GPUs also often have faster double precision support, typically more (and ECC) memory, sometimes more support for GPU peer access or GPU RDMA with other PCIe cards.

Normally you should solve first to get more than 48 threads run per SM, then often one RTX-5090 is more than enough, two are better of course (but see that your mainboard supports full electric and not just mechanic 16 lanes for both graphics cards).