Hello, I am doing experiment with H200 to be familiar with it.
I think that I found some bugs in H200. Can anyone let me know the reason?
As far as I know the result should be 8.0 not 5.5 for bug 1 and 8.0 not 4.0 for bug 2.
Thanks for your help.
<Bug 1 Code details>
a = torch.tensor([1.0 * 2**(-9)] * 0 + [1.0 * 2**(-9)] * (2**21), dtype=torch.float32, device=‘cuda’)
b = torch.tensor([[1.0 * 2**(-9)] * 0 + [1.0 * 2**(-9)] * (2**21)]*16, dtype=torch.float32, device=‘cuda’)
a = a.to(torch.float8_e4m3fn)
a = a.unsqueeze(0)
b = b.to(torch.float8_e4m3fn)
result = torch._scaled_mm(
input = a,
mat2 = b.t(),
scale_a = torch.tensor(\[1.0\], dtype=torch.float32, device='cuda'),
scale_b = torch.tensor(\[1.0\], dtype=torch.float32, device='cuda'),
#bias =
#scale_result=
#out_dtype=
use_fast_accum=True
)
print(“result:”, result[0][0])
<Bug 1 Printed result>
result: tensor(5.5000000000, device=‘cuda:0’, dtype=torch.float8_e4m3fn)
<Bug 2 Code details>
a = torch.tensor([1.0 * 2**(-9)] * 0 + [1.0 * 2**(-9)] * (2**21 + 2**4), dtype=torch.float32, device=‘cuda’)
b = torch.tensor([[1.0 * 2**(-9)] * 0 + [1.0 * 2**(-9)] * (2**21 + 2**4)]*16, dtype=torch.float32, device=‘cuda’)
a = a.to(torch.float8_e4m3fn)
a = a.unsqueeze(0)
b = b.to(torch.float8_e4m3fn)
result = torch._scaled_mm(
input = a,
mat2 = b.t(),
scale_a = torch.tensor(\[1.0\], dtype=torch.float32, device='cuda'),
scale_b = torch.tensor(\[1.0\], dtype=torch.float32, device='cuda'),
#bias =
#scale_result=
#out_dtype=
use_fast_accum=True
)
print(“result:”, result[0][0])
<Bug 2 Printed result>
result: tensor(4., device=‘cuda:0’, dtype=torch.float8_e4m3fn)