Accessing "real" connection status with EDID-cached outputs using NVWMI

mctozz · December 10, 2021, 9:00am

Hi,

Not sure if this is the right category but here goes.

We have recently started building systems with multiple A4000 in them (2 or 3) and have been running test regimes for long-running system stability.

The general config of the system is:

Win10 x64 (LTSC 1809)
Some multi-output mosaics along with standalone outputs, but
no mosaics across boards (no sync board).
Outputs are all EDID cached.
Latest feature release driver tested 496.49 as well as production branch 472.12
Physical screens on 4Kp30 native DP plus one or two at 1080p30 on HDMI via active adapter

One problem that we have picked up is that sometimes when powering the physical screens off and on again, we occasionally get a “no signal” output from GPU, even though EDID cache is on. This might happen to an isolated screen, or one occasion it was all of them. We can recover the situation by either repeatedly switching the monitor off and on, or by physically replugging the cable.

We have seen this happen with both Asus and LG screens so it’s not a screen problem.

This seems like a low level problem with the way the GPU handles a state change of the “Display Active” signal from the monitor.

We had hoped that the NVWMI might have given us a clue what was happening here, but it shows all monitors “active” all the time because of the EDID cache I guess.

Could you please assist with how we may get to the bottom of this? Could it be a VBIOS issue? (Not even sure where to find vbios updates for these boards)

Is there any way in NVWI to “see” the true status of the the connection in terms of whether the GPU is outputting a signal or not.

Thanks
mctozz

MarkusHoHo · December 13, 2021, 4:04pm

Hello @mctozz and welcome to the NVIDIA developer forums!

Could you maybe share your display topology in a bit more detail, possibly showing the system topology from the NVIDIA control panel so we get a better idea of your setup?
Also what do you mean by EDID cache, is it that you upload your EDID information through the control panel?
You mention you are using heterogeneous display connections, using native DP and what I assume DP to HDMI adapters. Is there a pattern to the signal loss, for example if it is just one screen, is it always just a native DP connection or always an adapter connection?
And when you power on the screens, do you power them on in order or all at once by a central switch?

I will also reach out to internal engineering if there are known issues that might fit your description.

Thanks!

mctozz · December 13, 2021, 11:12pm

Hi MarkusHoHo - have sent you topology screencap by DM.

You will see that one one of the GPU boards we have a 4 x 4Kp30 mosaic giving a 15360x2160 logical monitor.

By EDID cache I meant exporting and loading the EDID files against their respective outputs within NV control panel. (i.e. So Windows thinks they’re connected whether they are physically connected or not). However we don’t actually have a physical monitor connected to every output, just some of them.

Only one physical output has a DP to HDMI adapter on it, but the problem has been mostly observed on the direct native DP connections. We can reproduce the issue just by soft power cycling just one of the monitors. (And it’s not necessarily a particular one).

We also have a second problem that is not easily reproducible, it takes some number of days of continuous running, where we get a graphics system deadlock of some type - everything GUI-related is frozen inc mouse pointer, but O/S is still very much alive underneath. This problem manifests spontaneously and does not coincide with monitor power cycling activity. The last time this happened, disconnecting one of the monitors DP cables suddenly released the deadlock and everything carried on. If it wasn’t for the similarity of the involvement of the low level DP signal presence I would have said these were unrelated issues. But let’s focus on the first one problem because we can get that to reproduce reasonably easily, and that might lead to discoveries relating to the second. If Engineering has any other reports like this one please let us know though. Since this last happened we have switched to the Feature branch version and are waiting to see if it happens again. (The first issue happens with both the production and feature branches).

Cheers, mctozz

MarkusHoHo · December 14, 2021, 1:51pm

Thank you for the detailed information!

I will forward this to the experts since this is outside my personal expertise, hopefully they will have some suggestions how to address your issues.

MarkusHoHo · December 16, 2021, 1:03pm

Hi again,

I had an exchange with our experts and received a few clarifications and suggestions that might help you.

First of all regarding the the cached EDID information, this will NOT fake a permanent connection with DP monitors, it only swaps any real EDID from a sink for an EDID from file, whenever an EDID is being requested for a modeswitch for example. For DisplayPort specifically there needs to be link-training between sender and receiver on every modeswitch to negotiate configuration and bandwidth.

If a connected monitor is power-cycled there might or might not be a low-level link-training and proper reconnection can fail depending on the given situation. To verify this you could hot-unplug and replug the monitor without a signal.

So the best recommendation we can give is to NOT turn off the screens, but always (re)boot the system with the screens already on. In case of DP the cached EDID is not helping to guarantee a connection and correct bandwidth.

I also got some feedback regarding the other issue you mentioned. There is precedence with bad connections (bad cable or similar) that would trigger constant re-training of the DP link and that way cause an OS deadlock. To narrow down that issue you should test with shorter cables and avoid adapters.

Cable issues can of course also have an influence on general signal/bandwidth reliability.

I hope this helps you to troubleshoot your issues.

mctozz · December 17, 2021, 1:21am

Thanks for the follow through.

On the first issue, for the avoidance of doubt we’re going to source some different, high bandwidth DP cables to test this with, but I personally don’t think this is the problem in our testing environment.

Could you clarify, is it the function of the Windows drivers that does the low level DP link negotiation, or is it the firmware on the board? I know that there were firmware updates for Geforce relating to DP1.3 and DP1.4 compatibility, which implies it’s a firmware thing. (Where does one even check for firmware updates for RTX Ampere boards?)

With the second issue, I can see how constant link negotiation might screw up the responsiveness of the Windows UI if we weren’t using file-based edids to defeat the PnP mechanism, (and I have seen before what this issue looks like when it happens). But this UI deadlock is quite different - all UI and mouse pointer solid frozen - looks all the world like O/S has hung, except network processes are still responsive, but we could unlock the deadlock as soon as we pulled a DP cable. I have never, ever seen this kind of thing before. I am also not sure why Windows deadlock detection mechanism is not coming into play here, and resetting the drivers.

In any case, the other factoid that suggests this “constant retraining” is not the cause, is that this deadlock has always happened when the monitors were off (i.e in standby) and had been so for some hours, so there would not have been anything actually going on to trigger link renegotiation.

I’m not yet prepared to concede that there aren’t some bug(s) in driver or firmware relating to both these issues. It would be enormously helpful if we could see the low level logging that exists somewhere (or could be enabled) which is recording events relating to physical link negotiations. Could you please ask the experts how this would be done? Then we can send you back real evidence of whether there is a bug that needs further investigation.

I realise this is getting into the realms of something that belongs in a formal product support conversation, but we’re still very unclear how we actually do that for this nVidia product segment.

Thanks again.

MarkusHoHo · December 20, 2021, 1:39pm

Thank you for the detailed response!

This should happen as part of the driver, not the firmware.

Firmware updates are very rare for end users and only for critical issues. The DP related ones you mentioned were for example for Maxwell and Pascal based cards. They are announced through the usual channels like our blog or can be found in our knowledge base.

Of course! Software is never perfect and all our internal testing can only verify a finite number of setups. All the more important to be able to rule out any other possible reasons like faulty cables or similar.

I will see what I can do to find possibilities for you to look into display events beyond the Windows built-in event viewer, which I would recommend as a first step. But I can’t promise anything since this usually involves debug versions of our drivers which, as you will understand, we don’t give out to end users. If I receive information from the NVWMI team if that could be used to help here I will share this of course.

Regarding the Enterprise support you should have first level support through the third party you acquired your cards from. They should have ways of forwarding requests to NVIDIA. Beyond that you will need to go through the Enterprise support portal.