PyTorch 2.4 and below: 10x slowdown on NVIDIA GB10 (sm_121) due to missing kernel support

UserWarning: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0)

significant performance degradation