FP64 computation on budget


I am looking for a GPU to speed up a simple, but very large, matrix multiplication:
Basically, we have a matrix A that is 70,000,000 x 15,000 (~10 TB) and want to calculate transpose(A)*A.

I did some simple tests in Python with the cupy library and a 10,000,000x15,000 FP32 matrix and found the GPU to be multiple times faster. I know it will depend on the CPU and GPU, but let’s just work under the assumption that the GPU will provide a speed-up.

Unfortunately, the data I work with is FP64, which means that the GPUs at my disposal are not a big help.

I was hoping to get some advice on what GPU to buy if you want to do FP64 computations? The budget is around 2000-3000$.
I guess I could pick the GPU with the largest amount of GFLOPS (FP64) within my price range, but how does VRAM and transfer speeds come into play?

Thank you :)