I’m working in a small company that uses NVIDIA GPU for inferencing. We found that there’s some strange behavior during batch processing.
Results from batch processing’s different, compared to non-batched ones. And results in batch processing changes if # of tasks in batch changes.
For example, see the code below.
conv = torch.nn.Conv1d(40, 400, kernel_size=13, stride=1, padding=170, bias=True).cuda()
with torch.cuda.amp.autocast(enabled=True): # This allows 16bit casting
test_input = torch.rand(1, 40, 2536).cuda()
test_output = conv(test_input)
sample1 = test_output[0]
test_input = torch.cat([test_input, test_input], 0)
test_output = conv(test_input)
sample2 = test_output[0] # test_output[0] and test_output[1] are equal
diff = sample1 - sample2
diff = torch.abs(diff)
diff = diff.sum()
print(diff) # Expected output is zero.
We’re using the same test_input for non-batched tasks and batched tasks. We expected diff to be zero, but it isn’t. This only happens when we use mixed precision(autocast=True + NVIDIA T4(Turing)).
Also, the variable sample2 changes if # of tasks in a batch changes, but results in same batch are all the same. (test_output[0] == test_output[1] == …)
We’re trying to set-up some metric and debugging process and that requires reproducibility in production code(needs high performance). I’d like to know what I’m missing, and how I can workaround this while keeping fp16 throughput.
Thank you in advance!!