Inconsistency in LLAMA3.1 70B Output

Hi Everyone, We are noticing different output in NIM LLAMA3.1 70B version 1.1.1 for the same input. With temperature 0.0 and seed 42, we noticed that out of 4 attempts , the 3rd and 4th attempt differ from the first attempts. On searching further, I found that when in-flight batching is on(Non-determinism for identical sequences · Issue #1336 · NVIDIA/TensorRT-LLM · GitHub), there is a chance the results might differ due to TensorRT-LLM selecting different kernel based on the current batch size. Is there any solution to avoid scenarios like this

At this time - it looks like there’s no plan to support deterministic outputs with batch size > 1.

Hi @calexiuk , thanks for clarifying this. Is there a way where we can force NIM to use batch size one only then to avoid having non deterministic output? We cannot have multiple customers having different output for the same input

@rahulsingh55 there’s no option to modify the batch size at runtime right now, but we appreciate the feedback.