GPU has disappeared from bus

Hello,

We found some cases in the field that some Nano boxes rebooted unexpectedly by software reset.

terminal emulator log

And then, on one of such devices, the reason (at least one of) was found as below.

As you can see, it says “GPU has disappeared from bus!!”.

There are a couple of similar posts here in the forum, but which were about desktop GPUs, not Jetson. Yet, anyway, they were about GPU temperature. So, I double checked the GPU temperature, and which was way below the threshold, 34 degrees Celsius in the most recent case (reported by tegrastats).

L4T is r32.3.1, running on a Nano carrier board, with our own LTE module (SIMCOM/Telit).

I would appreciate if I get some points we should check especially from the hardware design point’s of view.

Thanks!

This probably is not the issue, but if the device tree were wrong, then the GPU might disappear. I’ve you’ve made any device tree changes, then you might check for conflict with the GPU.

FYI, you are correct that the desktop GPU not applying…they are PCI, but the Jetson GPU is integrated directly to the memory controller. Had the GPU been PCI, then device tree would not have been required to set up some of the GPU (PCI allows query of the device, but devices without a form of reply to requests for its details need device tree).

Thanks for your comment, @linuxdev

Yes, we use our own pinmux configuration. But do you think GPU could disappear only under a particular (unknown) condition? I will check our own pinmux spreadsheet.

Please follow this post to share necessary info. Especially share a text log instead of pictures.

You’ll need to add the content mentioned by @WayneWWW, but yes, an error in the device tree can make one device appear, but the other to disappear. Depends on the conflict. Specs are only written for how it works when not in error, so there is no way to define how it should behave in odd device tree error conditions. Just as a contrived example, if memory regions are reserved, and two devices are not intended to operate in the same memory region, then if both do there is no telling how one will behave as the other puts all of its init into that memory region (and device tree might determine this).

Oops, my bad.

Collected today. 3 files (dmesg, uart, syslog) are attached.

This is when I set uart console.

May 12 09:01:50 nvidia-desktop kernel: [    1.878472] KERNEL: PMC reset status reg: 0x0
May 12 09:01:50 nvidia-desktop kernel: [    1.878548] BL: PMC reset status reg: 0x0
May 12 09:01:50 nvidia-desktop kernel: [    1.878550] BL: PMIC poweroff Event Recorder: 0x50

And this is when the box got rebooted after GPU has gone, and uart left the message.

May 12 17:37:22 nvidia-desktop kernel: [    0.445992] tegra-pmc: ### PMC reset source: TEGRA_SOFTWARE_RESET
May 12 17:37:22 nvidia-desktop kernel: [    0.445997] tegra-pmc: ### PMC reset level: TEGRA_RESET_LEVEL_WARM
May 12 17:37:22 nvidia-desktop kernel: [    0.446001] tegra-pmc: ### PMC reset status reg: 0x3

gpu_disappeared.zip (711.5 KB)

I think other info was included in my first comment.

Thanks for your explanation. Now it sounds like the first thing I need to check.

Do you dump the log when error happen or you just give me a normal boot up log?

Of course, it is when this issue happened as I described above.

Is there application running when you hit this problem?

If there is any application to reproduce this issue, please also check if similar application will trigger such issue on devkit.

If there is no such application, please check if default pinmux will let your board hit this problem.

Also, please check if moving to rel-32.5 can resolve this issue or not. I think this is the first priority things to try.

No. This does not depend on an application but on a particular device. It is common that all of these affected devices are our LTE models.

No, as said above.

I got the confirmation. Our Nano model just uses the default pinmux spreadsheet as is to produce dtb files. So, this should not be an issue.

Now I suspect the radio wave of LTE module somehow electromagnetically affect the signal under a particular condition, for example, when the power of the radio wave is higher.

In order to get the required stability, we use a previous (stable) version. So, we have r32.4.4 based image. I would lose this reproducible environment, but could try to see if this issue gets resolved or not.

I would test that by wrapping the non-LTE component in grounded foil or other metal. If this is on something like an m.2 mount, then you could perhaps sandwich grounded foil between two very thin cardboard insulators and have at least a partial degree of RF separation.

The box with r32.4.4 has kept running for more than 15 hours. This box used to get rebooted 5~7 times a day. So, this looks very promising.

Thanks for your suggestion! I was wondering how I could test this effectively.

Yes, like M.2. An LTE module is mounted on a mini PCIe socket. So, I think I can wrap the whole board except for the LET module with aluminum foil.

If the upgrade to r32.4.4 won’t solve this issue, I will try this.

I have seen no such reboot anymore for about a week. So, I mark this as resolved by upgrading to L4T 32.4.4.

There is a patch for such GPU issue that is merged to rel-32.4.4. That was why suggest to go to >= rel-32.4.4.

Oh, thanks for the confirmation.

We haven’t seen such an issue on TX2 nor Xavier NX. But could it happen on other Jetson platforms or only on Nano?

The TX2, NX/Xavier are using different GPU architecture. The driver is also different.

Nano is T210 SoC while TX2 and Xavier are T186 and T194 SoC.