Boot Time Problem: nv_virtual_shutdown Service Fails

Software Version
DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
1.9.3.10904
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other


Summary:

Belief is that the Nvidia Linux systemd service’s dependency on modules being loaded isn’t working.

There is a device that we need to wait for – it exists when we check at runtime, but it seems it’s not there during boot.

The file is /dev/tegra_hv_pm_ctl

We are seeing an error that manifests as:

tegra-ubuntu:~$ systemctl status nv_virtual_shutdown
â—Ź nv_virtual_shutdown.service - Hypervisor initiated Shutdown Service
     Loaded: loaded (/lib/systemd/system/nv_virtual_shutdown.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2024-02-20 18:47:01 UTC; 23min ago
   Main PID: 1001 (code=exited, status=255/EXCEPTION)

Feb 20 18:47:01 tegra-ubuntu systemd[1]: Started Hypervisor initiated Shutdown Service.
Feb 20 18:47:01 tegra-ubuntu bash[1013]: chmod: cannot access '/dev/tegra_hv_pm_ctl': No such file or directory
Feb 20 18:47:01 tegra-ubuntu bash[1019]: hv_pm_ctl_init: Failed to open /dev/tegra_hv_pm_ctl, -2
Feb 20 18:47:01 tegra-ubuntu systemd[1]: nv_virtual_shutdown.service: Main process exited, code=exited, status=255/EXCEPTION
Feb 20 18:47:01 tegra-ubuntu systemd[1]: nv_virtual_shutdown.service: Failed with result 'exit-code'.

However, we can see that file actually does exist:

tegra-ubuntu:~$ ls -l /dev/tegra_hv_pm_ctl
crw-------. 1 root root 476, 0 Feb 20 18:47 /dev/tegra_hv_pm_ctl

Checking journalctl, we can see:

egra-ubuntu:~$ journalctl -b-1 -u nv_virtual_shutdown
-- Logs begin at Tue 2024-02-13 19:28:37 UTC, end at Tue 2024-02-20 19:21:59 UTC. --
Feb 20 18:47:01 tegra-ubuntu systemd[1]: Started Hypervisor initiated Shutdown Service.
Feb 20 18:47:01 tegra-ubuntu bash[1013]: chmod: cannot access '/dev/tegra_hv_pm_ctl': No such file or directory
Feb 20 18:47:01 tegra-ubuntu bash[1019]: hv_pm_ctl_init: Failed to open /dev/tegra_hv_pm_ctl, -2
Feb 20 18:47:01 tegra-ubuntu systemd[1]: nv_virtual_shutdown.service: Main process exited, code=exited, status=255/EXCEPTION
Feb 20 18:47:01 tegra-ubuntu systemd[1]: nv_virtual_shutdown.service: Failed with result 'exit-code'.

Request

Is there a known workaround for this problem, or a known fix?

Should there be a different dependency requirement in the .service file?

Or maybe the service should have an automatic restart on failure as well?

I didn’t encounter the issue on my system. According to my system status, the service has been active and running without any problems for the past five days.

Active: active (running) since Fri 2024-02-16 19:01:48 UTC; 5 days ago

Have you tried power cycling or reflashing your system to see if that resolves the issue?

Thanks for the quick response.

Yes, power-cycling the system has solved the issue in general.
There’s not enough sample size to say it fixes 100% reliably yet, though.

That said, power-cycling is not a valid fix for our use-case, in practice.

Instead, we need to resolve the problem & stop it from occurring entirely, rather than requiring random power-cycles – mainly because it causes confusion for the users of our system & is disruptive.

Any ideas about a deeper fix here?

A further important point is that this happens occasionally, and non-deterministically.

It doesn’t happen on every boot cycle, but seemingly randomly.

It seems to be a timing issue

Could you please provide detailed steps or specific conditions that reliably reproduce this issue? Understanding the exact circumstances in which it occurs will help us investigate and identify a more permanent solution. Thank you.

So the problem is a race condition between the appearance of the /dev/tegra_hv_pm_ctl device file for communicating with the hypervisor from Linux and when nv_virtual_shutdown.service is started.

The solution is to add: ConditionPathExists=/dev/tegra_hv_pm_ctl in the nv_virtual_shutdown.service file and create a new file called nv_virtual_shutdown.path with the following content:

[Path]
PathChanged=/dev/tegra_hv_pm_ctl

Unit=nv_virtual_shutdown.service

[Install]
WantedBy=multi-user.target

Then enable this path file using sudo systemctl enable nv_virtual_shutdown.path

For more information:

1 Like

Thank you for providing the solution. Could you confirm if you encountered this issue on a system running the default DRIVE OS 6.0.8.1 configuration?

Yes I have personally experienced this problem on a DriveAGX system running 6.0.8.1.

Do you mean without the change provided by nvidia previously via forum that was masking the service if it couldn’t find this same file? If that patch isn’t installed, reboot won’t work at all because the service won’t be running.

Could you provide detailed steps to reproduce the issue by rebooting? using “sudo reboot”?

So it’s a race condition between the driver loading and that device file appearing, and the nv_virtual_shutdown service starting with the expectation that file exists. If you reboot your DriveAGX units enough times with the previously provided by nvidia patch for no longer masking the service if that device file doesn’t exist, you’ll see it.

We reboot the units with the command:

echo 1 | sudo tee /sys/class/tegra_hv_pm_ctl/tegra_hv_pm_ctl/device/trigger_sys_reboot > /dev/null

The required solution will be incorporated in the next release. Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.