2D FFTs (real or complex float32) get slightly different results on a GTX1080Ti (Pascal) vs (V100, A40, A100) in CUDA 11. Arrays are in the 1024x1024 to 4096x4096 range. We saw exactly the same results for Pascal vs. Volta using the cufft that shipped with CUDA 9.2.88p1. Is this expected and is there some setting that can be used to get exactly the same results across all three architectures?
Identical results are not guaranteed between architectures due to implementations and optimizations specific to each architecture, within updated software stacks.
Yes, of course, yet we did get binary-identical results for Pascal and Volta, and now Volta and Ampere, suggesting it might be possible for all three to get the same results…
It’s possible Volta+ kernels were updated in CTK 11.
You are likely right about that. I’ll ask via a bug/problem report. Thanks for your quick response!