GPU has disappeared from bus

TakenoriSato · May 10, 2021, 7:36am

Hello,

We found some cases in the field that some Nano boxes rebooted unexpectedly by software reset.

terminal emulator log

And then, on one of such devices, the reason (at least one of) was found as below.

As you can see, it says “GPU has disappeared from bus!!”.

There are a couple of similar posts here in the forum, but which were about desktop GPUs, not Jetson. Yet, anyway, they were about GPU temperature. So, I double checked the GPU temperature, and which was way below the threshold, 34 degrees Celsius in the most recent case (reported by tegrastats).

L4T is r32.3.1, running on a Nano carrier board, with our own LTE module (SIMCOM/Telit).

I would appreciate if I get some points we should check especially from the hardware design point’s of view.

Thanks!

linuxdev · May 10, 2021, 4:07pm

This probably is not the issue, but if the device tree were wrong, then the GPU might disappear. I’ve you’ve made any device tree changes, then you might check for conflict with the GPU.

FYI, you are correct that the desktop GPU not applying…they are PCI, but the Jetson GPU is integrated directly to the memory controller. Had the GPU been PCI, then device tree would not have been required to set up some of the GPU (PCI allows query of the device, but devices without a form of reply to requests for its details need device tree).

TakenoriSato · May 11, 2021, 12:03am

Thanks for your comment, @linuxdev

Yes, we use our own pinmux configuration. But do you think GPU could disappear only under a particular (unknown) condition? I will check our own pinmux spreadsheet.

WayneWWW · May 11, 2021, 2:07am

Please follow this post to share necessary info. Especially share a text log instead of pictures.

linuxdev · May 11, 2021, 4:37pm

You’ll need to add the content mentioned by @WayneWWW, but yes, an error in the device tree can make one device appear, but the other to disappear. Depends on the conflict. Specs are only written for how it works when not in error, so there is no way to define how it should behave in odd device tree error conditions. Just as a contrived example, if memory regions are reserved, and two devices are not intended to operate in the same memory region, then if both do there is no telling how one will behave as the other puts all of its init into that memory region (and device tree might determine this).

TakenoriSato · May 12, 2021, 9:46am

Oops, my bad.

Collected today. 3 files (dmesg, uart, syslog) are attached.

This is when I set uart console.

May 12 09:01:50 nvidia-desktop kernel: [    1.878472] KERNEL: PMC reset status reg: 0x0
May 12 09:01:50 nvidia-desktop kernel: [    1.878548] BL: PMC reset status reg: 0x0
May 12 09:01:50 nvidia-desktop kernel: [    1.878550] BL: PMIC poweroff Event Recorder: 0x50

And this is when the box got rebooted after GPU has gone, and uart left the message.

May 12 17:37:22 nvidia-desktop kernel: [    0.445992] tegra-pmc: ### PMC reset source: TEGRA_SOFTWARE_RESET
May 12 17:37:22 nvidia-desktop kernel: [    0.445997] tegra-pmc: ### PMC reset level: TEGRA_RESET_LEVEL_WARM
May 12 17:37:22 nvidia-desktop kernel: [    0.446001] tegra-pmc: ### PMC reset status reg: 0x3

gpu_disappeared.zip (711.5 KB)

I think other info was included in my first comment.

TakenoriSato · May 12, 2021, 9:48am

Thanks for your explanation. Now it sounds like the first thing I need to check.

WayneWWW · May 12, 2021, 9:59am

Do you dump the log when error happen or you just give me a normal boot up log?

TakenoriSato · May 12, 2021, 10:18am

Of course, it is when this issue happened as I described above.

WayneWWW · May 12, 2021, 10:47am

Is there application running when you hit this problem?

WayneWWW · May 12, 2021, 10:50am

If there is any application to reproduce this issue, please also check if similar application will trigger such issue on devkit.

If there is no such application, please check if default pinmux will let your board hit this problem.

WayneWWW · May 12, 2021, 10:53am

Also, please check if moving to rel-32.5 can resolve this issue or not. I think this is the first priority things to try.

TakenoriSato · May 13, 2021, 12:35am

No. This does not depend on an application but on a particular device. It is common that all of these affected devices are our LTE models.

No, as said above.

I got the confirmation. Our Nano model just uses the default pinmux spreadsheet as is to produce dtb files. So, this should not be an issue.

Now I suspect the radio wave of LTE module somehow electromagnetically affect the signal under a particular condition, for example, when the power of the radio wave is higher.

In order to get the required stability, we use a previous (stable) version. So, we have r32.4.4 based image. I would lose this reproducible environment, but could try to see if this issue gets resolved or not.

linuxdev · May 13, 2021, 6:37pm

I would test that by wrapping the non-LTE component in grounded foil or other metal. If this is on something like an m.2 mount, then you could perhaps sandwich grounded foil between two very thin cardboard insulators and have at least a partial degree of RF separation.

TakenoriSato · May 14, 2021, 1:22am

The box with r32.4.4 has kept running for more than 15 hours. This box used to get rebooted 5~7 times a day. So, this looks very promising.

TakenoriSato · May 14, 2021, 1:30am

Thanks for your suggestion! I was wondering how I could test this effectively.

Yes, like M.2. An LTE module is mounted on a mini PCIe socket. So, I think I can wrap the whole board except for the LET module with aluminum foil.

If the upgrade to r32.4.4 won’t solve this issue, I will try this.

TakenoriSato · May 20, 2021, 1:02am

I have seen no such reboot anymore for about a week. So, I mark this as resolved by upgrading to L4T 32.4.4.

WayneWWW · May 20, 2021, 3:25am

There is a patch for such GPU issue that is merged to rel-32.4.4. That was why suggest to go to >= rel-32.4.4.

TakenoriSato · May 20, 2021, 3:41am

Oh, thanks for the confirmation.

We haven’t seen such an issue on TX2 nor Xavier NX. But could it happen on other Jetson platforms or only on Nano?

WayneWWW · May 20, 2021, 3:47am

The TX2, NX/Xavier are using different GPU architecture. The driver is also different.

Nano is T210 SoC while TX2 and Xavier are T186 and T194 SoC.

Topic		Replies	Views
Gk20a and Jetson Nano crash Jetson Nano kernel , nvbugs	45	4991	October 16, 2020
No display for gpu error Jetson Nano gpu	8	836	April 26, 2023
L4T 32.4.2 - GPU error on boot Jetson Nano boot , nvbugs	42	4294	October 15, 2021
The system restarts due to an error reported by nvgpu Jetson Nano jetson	5	124	July 25, 2025
Custom board using nano poweroff err Jetson Nano boot , board-design	7	116	May 6, 2025
Jetson Nano on custom carrier board stuck at nvidia logo Jetson Nano boot , board-design	28	1935	October 15, 2021
GPIO bug after long time Jetson Nano gpio	11	1232	October 18, 2021
Reliability issue while booting Jetson Nano SOM ( taken out from JetsonNano developerkit ) with our custom carrier board GPU - Hardware hw , kernel , board-design	2	746	December 4, 2021
Jetson Xavier NX GPU lib Report Error Jetson Xavier NX board-design , gpu	41	2010	July 13, 2022
PCIe, a riser and a Nano Jetson Nano	12	2325	March 16, 2023

GPU has disappeared from bus

Related topics