Is VBIOS version significant to report for reproducibility?

I’m trying to be as specific as possible in my report for reproducibility of the model I recently developed.

Across machines with different GPU models the results of training differ, understandably. But while on all our RTX 4090 machines the results are consistent, on our RTX 3090 machines there are two possible results.

After digging deeper, it seems that on our RTX 4090 machines, two VBIOS versions are possible - 95.02.3C.40.95 and 95.02.18.C0.75. However, the results are the same on both of them.

On our RTX 3090 machines, a number of VBIOS versions are possible - 94.02.42.??.?? and 94.02.26.??.??. Those machines with 94.02.42.??.?? produce the same result, and those with 94.02.26.??.?? produce the same (but different from the former) result.

I didn’t find any differences in the frequencies of the RTX 3090 cards though.

So my question is: May different VBIOS versions of the same GPU model lead to different training results? This is not the case on our RTX 4090 machines, but it might be the source of inconsistency on our RTX 3090 machines.

The different VBIOS versions may well be a red herring. I would consider that likely in this context.

Generally speaking, there are interactions between the VBIOS and the NVIDIA driver stack, and while I cannot think of anything in them that would affect the computational behavior of CUDA applications in this regard, I can also not categorically exclude the possibility. This makes the VBIOS version potentially significant for reproducibility, so why not simply add a few bytes to the bug report to record it.

I am assuming the RTX 3090 GPUs mentioned all operate with the same NVIDIA software stack: same driver versions, same CUDA versions. If so and if this were my setup, my working hypothesis would be that differences in the host system (in particular differences in the software stack on the host system) are the cause of the differences observed in the model. I would also consider the possibility that the model software contains a latent bug that is often masked, but is sometimes exposed depending on timing, for example a race condition of some kind.

An easy experiment in these kind of situations is typically to cyclically swap the GPU(s) through the host systems to see whether the issue is associated with the GPU(s) or the host system.