Determinism with mixed precision

I’m working in a small company that uses NVIDIA GPU for inferencing. We found that there’s some strange behavior during batch processing.

Results from batch processing’s different, compared to non-batched ones. And results in batch processing changes if # of tasks in batch changes.

For example, see the code below.

conv = torch.nn.Conv1d(40, 400, kernel_size=13, stride=1, padding=170, bias=True).cuda()
with torch.cuda.amp.autocast(enabled=True): # This allows 16bit casting
test_input = torch.rand(1, 40, 2536).cuda()
test_output = conv(test_input)
sample1 = test_output[0]

test_input = torch.cat([test_input, test_input], 0)
test_output = conv(test_input)
sample2 = test_output[0] # test_output[0] and test_output[1] are equal

diff = sample1 - sample2
diff = torch.abs(diff)
diff = diff.sum()
print(diff) # Expected output is zero.

We’re using the same test_input for non-batched tasks and batched tasks. We expected diff to be zero, but it isn’t. This only happens when we use mixed precision(autocast=True + NVIDIA T4(Turing)).

Also, the variable sample2 changes if # of tasks in a batch changes, but results in same batch are all the same. (test_output[0] == test_output[1] == …)

We’re trying to set-up some metric and debugging process and that requires reproducibility in production code(needs high performance). I’d like to know what I’m missing, and how I can workaround this while keeping fp16 throughput.

Thank you in advance!!

Hi @diediealldie ,
Apologies for the delay, Are you still facing the issue?