No devices found error for Nvidia-smi command - RTX A2104Ax

Hi,

We have installed Nvidia driver NVIDIA-Linux-x86_64-570.169.run for the GPU card - RTX NVA2104AxX.

But when we do AC power cycle, and execute nvidia-smi command, we get error saying No devices found

But when we do soft reboot through command, we can see the nvidia-smi command showing the data.

Thanks,
Nagesh R

can some one reply to this query and help us. As we have stuck up in our product release for long time.

Thanks.

Hi @nagesh_accord , please take a bug report after the issue occurs and attach it here.
To take report, you can run sudo nvidia-bug-report.sh in a terminal, and the nvidia-bug-report.log.gz file will be created in the current working directory.

Thanks for the updates.

From where can we down load this script?

Hi @nagesh_accord , the script should be present already if you have the NVIDIA driver installed.
Please go through this link, adding these details will help us in debugging the issue :

Thanks.

Thanks for the updates.

I want explain the issue background and how it was resolved with some work around. We have shipped unit to customer place after that.

  1. We were getting nvidia-smi command “No devices found” error only on AC power cycle, where as after normal reboot command it was working fine.
  2. This was some thing to do with the booting and timing /power initialization issue of NVidia GPU , where the driver binding was failing as SBC was booting much faster than the GPU initialization.
  3. We fixed this issue after removing RTC battery from our unit, which result in slow booting time so the NVidia GPU gets enough time to get powered and initialized and so that Nvidia drivers binding happaned successfully and things started working fine.

However from yesterday we are facing new issue at the customer site.

It would be great, if you can guide us about this.

  1. Customer by mistake executed sudo dnf update command, which resulted in upgrading of the RHEL OS version, from RHEL 9.4 to RHEL 9.6
  2. With this Nvidia-smi command is no longer working and they are getting the below error.

  1. Could you provide any solution for this.

Note: They have only primary display at customer site and dont have SDOUT monitor ( secondary display NVidia GPU).

Thanks.

Hi @nagesh_accord , I think the error is due to a driver version mismatch. The dnf install command might have installed a newer version of the driver without completely uninstalling the older version, which could lead to this error. I think a complete uninstall/purge of the existing driver and a reinstall of the driver should fix this issue. It would helpful if you can get a copy of the bug report from the customer. That would help us to check if it is indeed the same issue, or a different one.
Thanks.

Could you provide exact commands to be used for purging/uninstalling NVidia driver NVIDIA-Linux-x86_64-570.169 on RHEL 9.4/9.6.

Please note, The purge/uninstall should not effect the primary display based on Intel iGPU from the onboard SBC. It should not delete other drivers and we should not get blank display in the primary display and end up in problem as the unit is at customer place.

Pls find the bug report attached.
nvidia-bug-report.log.gz (400.3 KB)

Apart from the above query we have few more questions regarding display resolutions issues for Primary monitor coming directly from SBC supported by NVidia driver..

  1. We observe that we cannot see the login screen in primary display during booting( model: AXIOMTEK.CO.,LTD.19"). it is going to extended mode, where secondary monitor display from GPU is connected.
    After we login, we can see the primary display working fine again.
    any idea for this behaviour?

  2. Also, if we set the primary display to 1280*1024 in display manager settings in the GUI and reboot the unit, we are not able to see the login screen during the booting again.Any idea why?

  3. Also, we observe that we can always see 4 SDIOUT displays shown as Connected with the XRANDR command, even though we have not connected any secondary monitors?

Please explain all these unknowns, so that we can clearly understand how this stuff works.

Thanks.

I have attached few images for reference.