FP64 computation on budget


I am looking for a GPU to speed up a simple, but very large, matrix multiplication:
Basically, we have a matrix A that is 70,000,000 x 15,000 (~10 TB) and want to calculate transpose(A)*A.

I did some simple tests in Python with the cupy library and a 10,000,000x15,000 FP32 matrix and found the GPU to be multiple times faster. I know it will depend on the CPU and GPU, but let’s just work under the assumption that the GPU will provide a speed-up.

Unfortunately, the data I work with is FP64, which means that the GPUs at my disposal are not a big help.

I was hoping to get some advice on what GPU to buy if you want to do FP64 computations? The budget is around 2000-3000$.
I guess I could pick the GPU with the largest amount of GFLOPS (FP64) within my price range, but how does VRAM and transfer speeds come into play?

Thank you :)

As you are probably already aware, relative FP64 throughput can be gauged across the different SM Compute Capabilities here. Unfortunately it’s only the models ending in “X.0” or Tesla class, that perform well.

For your budget, realistically a second hand V100 is about the only option.