Hi everyone,
I’m a retired programmer and I’m porting a personal C++ application to CUDA. It has no matrix operations, or graphics. It is simple math. I’m writing the code in order to minimize branching and such so to limit the cuda core interdependencies within each SM. The sweet spot currently is 48 threads per SM, which is 77% faster than trying to use all 128 cores. I do not know enough to identify if this is due to memory bandwidth or some other factor.
Regardless, I’m building a rig to run it. I’m choosing two RTX-5090s instead of an RTX 6000 Ada or a L40 type GPU, The A100 and H100 are out of my budget as this is a personal project.
Question: Can someone confirm that the focus of the professional GPUs is on special features such as tensors, etc. and that my use case would not necessarily sufficiently benefit from those given that it is much simpler.
Other than experience, is there another way to assess this?
Thanks!