Nvidia-smi

Hi! What shows “Replays Since Reset” param in “nvidia-smi” util?
Can you explain in more detail?

It’s in the PCI section of the output. It refers to PCIE replays. If you google that, you’ll find descriptive info such as this document. Excerpting from there:

The function that the Replay Buffer provides can be described without going into
too much detail about how the ACK/NAK protocol works. Before a TLP is
transmitted on the line, it is placed in the Replay Buffer and remains there until
that TLP has been positively acknowledged by the receiving device on the other
end of the link. Once an acknowledge packet arrives from the receiver which
acknowledges that TLP, it can be removed from the Replay Buffer providing
additional room for new TLPs. However, if no acknowledge packet was received
or the acknowledge packet that was received indicated a negative acknowledge,
then that TLP and any TLPs transmitted after it must be retransmitted or
“replayed” out of the Replay Buffer. Even after being retransmitted, those TLPs
must remain in the Replay Buffer until the receiver positively acknowledges it
received them in order and without error.

I won’t be able to go into a tutorial on PCIE here. I may not be able to answer follow-up questions, but if you want to learn how PCIE works, google will likely provide useful info/results.

So… If i got replays more then zero, it’s bad or not?

I am not an PCIe expert, but my understanding is that PCIe replay is a robustness-enhancing mechanism that allows hardware to automatically fix certain transmissions error, at some small overhead in performance. In this sense one might compare it to ECC for DRAM.

Just like an occasional single-bit error corrected by ECC is nothing to worry about, the same should apply to PCIe replays. The question then of course becomes, what error rate is generally considered acceptable. I don’t have a grasp on that. What rate of increase are you seeing for this counter per day?

The reliability of PCIe transmission can be negatively impacted by (1) use of riser cards (2) dirty contacts or incorrect seating of a plugable device in the PCIe slot (3) a noisy electromagnetic environment (4) possibly a challenging mechanical environment (vibration).

From my time working with DOCSIS cable technology, typical sources of electromagnetic noise in a household environment are vacuum cleaners and electric drills, or more generally electric motors.

Poor layout or fabrication quality of motherboards etc can of course also contribute to a reduction in PCIe transmission quality, and I would expect that aspect to gain importance when transitioning to PCIe versions with higher transmission rates: PCIe gen3 → PCIe gen4 → PCIe gen5. Are you operating your GPUs with a PCIe gen3 or PCIe gen4 configuration?