AGX GPU is sometimes not detected

We have an issue where our AGX GPU is not always detected on boot.

The debug (micro USB) logs show the following just before the login prompt:

    8.387014] random: crng init done
[    8.387130] random: 7 urandom warning(s) missed due to ratelimiting
[    8.454196] using random self ethernet address
[    8.454354] using random host ethernet address
[   10.981103] Bridge firewalling registered

Ubuntu 18.04.6 LTS fenchurch ttyTCU0

fenchurch login: [   14.981416] pva 16000000.pva0: failed to get free Queue
[   14.981555] pva 16000000.pva0: failed to get free Queue
[   14.982745] pva 16000000.pva0: failed to get free Queue
[   14.982890] pva 16000000.pva0: failed to get free Queue
[   14.984887] pva 16000000.pva0: failed to get free Queue
[   14.985265] pva 16000000.pva0: failed to get free Queue
[   15.040015] pva 16800000.pva1: failed to get free Queue
[   15.040160] pva 16800000.pva1: failed to get free Queue
[   15.041177] pva 16800000.pva1: failed to get free Queue
[   15.042375] pva 16800000.pva1: failed to get free Queue
[   15.043357] pva 16800000.pva1: failed to get free Queue
[   15.044903] pva 16800000.pva1: failed to get free Queue

More notes:

  • No HDMI connected (same happens with a screen connected).
  • No USB-C connected (same happens with devices connected).
  • 18V, 10A capable power supply

I have no idea how you get “GPU” not detected by just this log and no HDMI monitor connected…

How did you tell it is not detected? PVA is not GPU…

You should dump dmesg and also connect HDMI on your board. Share the software release version to us.

Hi @WayneWWW here is some more detail,

Our first encounter with this issue was when nvidia-container-cli started giving the following error:

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected: unknown.
ERRO[0000] error waiting for container: context canceled

We intermittently saw this issue after the following changes. By intermittent, I mean once every 4-5 boots. A bad shutdown also seems to cause the issue, which can then be resolved with a restart.

  • L4T update to 32.7.2 and L4T_DOCKER to 32.7.1
  • Turning on the pll for the can clock to operate at an exact 250000bps.
  • Changing cboot
+		pll_source = "pllaon";
+		clocks = <&bpmp_clks TEGRA194_CLK_CAN1_CORE>,
+			<&bpmp_clks TEGRA194_CLK_CAN1_HOST>,
+			<&bpmp_clks TEGRA194_CLK_CAN1>,
+			<&bpmp_clks TEGRA194_CLK_PLLAON>;
+		clock-names = "can_core", "can_host", "can", "pllaon";
/dts-v1/;

/ {
	compatible = "nvidia,cboot-options-v1";
	boot-configuration {
		boot-order = "sd", "usb", "emmc", "nvme", "net";
		tftp-server-ip = /bits/ 8 <192 168 0 1>;
		dhcp-enabled;
	};
};

Update:
Waiting in the debug (micro USB) terminal before typing boot+enter does not seem to have an effect. This might just have been coincidental. The only concrete information I have at this point is that nvidia-docker containers do not start when the pva0: failed to get free Queue pops up in the micro-usb debug logs.

Running: nvidia-container-cli info:

nvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected

Here is a dmesg dump, as requested. Monitor connected, display works fine and shows our configured splash screen.
dmesg_dump.txt (77.7 KB)

On a good boot we get:

[    0.853567] iommu: Adding device 16000000.pva0 to group 48
[    0.854228] iommu: Adding device 16800000.pva1 to group 49
[    1.636979] pva 16000000.pva0: initialized
[    1.668024] pva 16800000.pva1: initialized

On a bad boot we get:

[    0.856987] iommu: Adding device 16000000.pva0 to group 48
[    0.857619] iommu: Adding device 16800000.pva1 to group 49
[    1.508835] pva 16000000.pva0: initialized
[    1.540228] pva 16800000.pva1: initialized
[   13.835500] pva 16000000.pva0: failed to get free Queue
[   13.835657] pva 16000000.pva0: failed to get free Queue
[   13.837465] pva 16000000.pva0: failed to get free Queue
[   13.846573] pva 16000000.pva0: failed to get free Queue

Hi,

If you don’t change the device tree and dtbo, would you hit this issue? Those are for CAN bus and boot device, right?

Hi @WayneWWW,

Thanks for the quick replies. I was about to revert those changes when I first decided to completely uninstall docker.io. For some reason, I could not disable the service on startup using systemctl.

After this, the AGX boots without the pva fail issue. Another good sign is jetson_clocks works. This did not work if the AGX boots with the pva fail issue.

So in conclusion. To fix the issue, we purged the docker cache with rm -rf /var/lib/docker. It seems that something in the docker update did not like our old images and caused this pva issue.

Hopefully, this helps someone else.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.