Wrong answer with mma.sync.aligned.m8n8k4

Robert_Crovella · April 3, 2023, 3:28pm

The result of the product itself is always 0.8x0.7x4 (because k=4), in each output location. That value taking into account all considerations for doing that in fp16 is 2.24023.

At each step, you are summing that value with the sum of the previous iterations. As the sum of the previous iterations gets large (relative to what can be represented in fp16), then the result of the sum of e.g. 8192+2.24023 doesn’t give you 8194.24023 as you might expect.

This problem is due to the limited range of the mantissa/significand in any modern “floating point” number representation. The difference between the largest and smallest number that can be combined will vary based on the accuracy you expect.

Topic		Replies	Views
Differences in Precision Between Tensor Cores and CUDA Cores CUDA Programming and Performance cuda	1	113	January 10, 2025
How does it compute exactly in Tensor Core? CUDA Programming and Performance	10	857	August 22, 2024
Error or incomprehension, MMa ptx mixed precision Bfloat16 rtx3080 CUDA Programming and Performance	20	2248	October 12, 2021
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2546	August 12, 2017
matrix multiply reduction CUDA Programming and Performance	41	35551	January 15, 2011
Arguments mismatch for instruction 'mma', why? CUDA Programming and Performance	7	537	November 13, 2023
float asssociative Debugging error CUDA Programming and Performance	10	2235	April 12, 2010
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	2482	May 15, 2024
Cuda code performance CUDA Programming and Performance	14	3152	December 16, 2014
Wrong result after a certain number of iterations Execution doesn't give the same results as the CUDA Programming and Performance	11	4543	June 29, 2009

Wrong answer with mma.sync.aligned.m8n8k4

Related topics