Non-deterministic kernel panic due to bug in tegra19x_cbb.c

Hello,

when trying to get a new L4T 34.1.1 based image (/w Kernel 5.10+) to run with our custom .cfg (on custom carrier board), there seem to be randomly thrown kernel panics during boot up.

This is happening in aprox. 30-40% of all boot attempts, and it is thrown always in conjunction with CBB-NOC@0x2300000,irq=15 (see logs).

There were no such problems with the respective L4T 32.5.1 based image, and we made no modifications on the config items beside aligning our config to the new jetson-xavier-devkit.cfg.

As already mentioned in No support for t194 SOC Family in tegra-defconfig with L4T 34.1.x? - Jetson & Embedded Systems / Jetson Xavier NX - NVIDIA Developer Forums, it seems that the kernel cfg option for tegra19x does not work.
Nevertheless, the moderator assured that tegra19x support is available with the default kernel config (in which the corresponding option for tegra19x is not selected!!!), and thus we tried with tegra_defconfig.

No, it seems to clearify that there seems to be some major issue with tegra19x support (what could lead to such an heavy behaviour), or some general and hard bug on IRQ level in other parts of the defconfig.

A short excerpt of the log part for the panic:

[   22.860604] CPU:0, Error:CBB-NOC@0x2300000,irq=15
[   22.870756] **************************************
[   22.870937] * For more Internal Decode Help
[   22.871145] *     http://nv/cbberr
[   22.871268] * NVIDIA userID is required to access
[   22.871442] **************************************
[   22.871624] CPU:0, Error:CBB-NOC
[   22.871781] 	Error Logger		: 0
[   22.871958] 	ErrLog0			: 0x80030000
[   22.872092] 	  Transaction Type	: RD  - Read, Incrementing
[   22.872283] 	  Error Code		: SLV
[   22.872418] 	  Error Source		: Target
[   22.873558] 	  Error Description	: Target error detected by CBB slave
[   22.880151] 	  AXI2APB_4 bridge error: RDFIFOF - Read Response FIFO Full interrupt
[   22.887742] 	  Packet header Lock	: 0
[   22.891232] 	  Packet header Len1	: 3
[   22.894905] 	  NOC protocol version	: version >= 2.7
[   22.899890] 	ErrLog1			: 0x351626
[   22.903130] 	ErrLog2			: 0x0
[   22.905755] 	  RouteId		: 0x351626
[   22.909176] 	  InitFlow		: ccroc_p2ps/I/ccroc_p2ps
[   22.914413] 	  Targflow		: host1x_p2pm/T/host1x_p2pm
[   22.918968] 	  TargSubRange		: 11
[   22.922462] 	  SeqId			: 0
[   22.925265] 	ErrLog3			: 0x30124
[   22.928536] 	ErrLog4			: 0x0
[   22.931143] 	  Address accessed	: 0x155f0124
[   22.935680] 	ErrLog5			: 0x989f851
[   22.939088] 	  Non-Modify		: 0x1
[   22.942501] 	  AXI ID		: 0x13
[   22.945656] 	  Master ID		: CCPLEX
[   22.948722] 	  Security Group(GRPSEC): 0x7e
[   22.952925] 	  Cache			: 0x1 -- Bufferable 
[   22.957379] 	  Protection		: 0x2 -- Unprivileged, Non-Secure, Data Access
[   22.964204] 	  FALCONSEC		: 0x0
[   22.967382] 	  Virtual Queuing Channel(VQC): 0x0
[   22.971995] 	**************************************
[   22.976807] ------------[ cut here ]------------
[   22.981533] kernel BUG at /Linux_for_Tegra/l4t_new/sources/kernel/nvidia/drivers/platform/tegra/cbb/tegra19x_cbb.c:553!
[   22.992302] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[   22.997637] Modules linked in:
[   23.000539] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.65-tegra #2
[   23.007164] Hardware name: Unknown NVIDIA Jetson Xavier NX Developer Kit/NVIDIA Jetson Xavier NX Developer Kit, BIOS r34.1-975eef6 05/16/2022
[   23.019439] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[   23.025510] pc : tegra194_cbb_error_isr+0x198/0x1b0
[   23.030744] lr : tegra194_cbb_error_isr+0xcc/0x1b0
[   23.035432] sp : ffff800010003df0
[   23.039042] x29: ffff800010003df0 x28: 0000000000000001 
[   23.044189] x27: ffffcad10588c598 x26: 0000000000000080 
[   23.049605] x25: ffffcad1061c9420 x24: 0000000000000001 
[   23.055378] x23: ffffcad105ad6000 x22: 000000000000000f 
[   23.060456] x21: ffffcad106044e60 x20: 0000000000000005 
[   23.066230] x19: ffffcad106044f38 x18: 0000000000000010 
[   23.071397] x17: ffffcad104d1d970 x16: 0000000000000068 
[   23.077169] x15: ffffcad105dc2b30 x14: 0720072007200720 
[   23.082421] x13: 0720072007200720 x12: 0720072007200720 
[   23.088108] x11: 0720072007200720 x10: 0720072007200720 
[   23.093371] x9 : 0720072007200720 x8 : 07200720072a072a 
[   23.099053] x7 : 072a072a072a072a x6 : c0000000ffffefff 
[   23.104120] x5 : 0000000000057fa8 x4 : ffffcad105dd78a0 
[   23.109549] x3 : 00000000ffffffff x2 : ffffcad1042a6d10 
[   23.115138] x1 : ffffcad105dc25c0 x0 : 0000000100010001 
[   23.120481] Call trace:
[   23.122960]  tegra194_cbb_error_isr+0x198/0x1b0
[   23.127502]  __handle_irq_event_percpu+0x60/0x2a0
[   23.132031]  handle_irq_event_percpu+0x3c/0xa0
[   23.136234]  handle_irq_event+0x4c/0xf0
[   23.140258]  handle_fasteoi_irq+0xbc/0x170
[   23.144318]  generic_handle_irq+0x3c/0x60
[   23.148302]  __handle_domain_irq+0x6c/0xc0
[   23.152512]  efi_header_end+0xa8/0xf0
[   23.155780]  el1_irq+0xd0/0x180
[   23.158734]  cpuidle_enter_state+0xb4/0x400
[   23.162925]  cpuidle_enter+0x3c/0x50
[   23.166455]  call_cpuidle+0x40/0x70
[   23.169921]  do_idle+0x1fc/0x260
[   23.173586]  cpu_startup_entry+0x2c/0x70
[   23.177116]  rest_init+0xd8/0xe4
[   23.180611]  arch_call_rest_init+0x14/0x1c
[   23.184816]  start_kernel+0x4c0/0x4f4
[   23.188153] Code: a9425bf5 a9446bf9 a8c67bfd d65f03c0 (d4210000) 
[   23.194300] ---[ end trace b8452aa19fc46e1e ]---
[   23.198715] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
[   23.206494] SMP: stopping secondary CPUs
[   23.210450] Kernel Offset: 0x4ad0f40f0000 from 0xffff800010000000
[   23.216464] PHYS_OFFSET: 0xffffd2bec0000000
[   23.220585] CPU features: 0x8240002,03002a30
[   23.224816] Memory Limit: none
[   23.242394] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt ]---

Second error message variant is:

[   22.828355] CPU:0, Error:CBB-NOC@0x2300000,irq=15
[   22.828593] **************************************
[   22.828779] * For more Internal Decode Help
[   22.828930] *     http://nv/cbberr
[   22.829076] * NVIDIA userID is required to access
[   22.829218] **************************************
[   22.829395] CPU:0, Error:CBB-NOC
[   22.829545] 	Error Logger		: 0
[   22.829685] 	ErrLog0			: 0x80030000
[   22.829820] 	  Transaction Type	: RD  - Read, Incrementing
[   22.833597] 	  Error Code		: SLV
[   22.836922] 	  Error Source		: Target
[   22.840760] 	  Error Description	: Target error detected by CBB slave
[   22.847087] 	  AXI2APB_4 bridge error: RDFIFOF - Read Response FIFO Full interrupt
[   22.854681] 	  Packet header Lock	: 0
[   22.858172] 	  Packet header Len1	: 3
[   22.861891] 	  NOC protocol version	: version >= 2.7
[   22.866586] 	ErrLog1			: 0x35162c
[   22.869815] 	ErrLog2			: 0x0
[   22.872959] 	  RouteId		: 0x35162c
[   22.876118] 	  InitFlow		: ccroc_p2ps/I/ccroc_p2ps
[   22.881359] 	  Targflow		: host1x_p2pm/T/host1x_p2pm
[   22.885912] 	  TargSubRange		: 11
[   22.889411] 	  SeqId			: 0
[   22.891949] 	ErrLog3			: 0x30124
[   22.895198] 	ErrLog4			: 0x0
[   22.898372] 	  Address accessed	: 0x155f0124
[   22.902546] 	ErrLog5			: 0xb09f851
[   22.906040] 	  Non-Modify		: 0x1
[   22.909185] 	  AXI ID		: 0x16
[   22.912347] 	  Master ID		: CCPLEX
[   22.915668] 	  Security Group(GRPSEC): 0x7e
[   22.919896] 	  Cache			: 0x1 -- Bufferable 
[   22.924114] 	  Protection		: 0x2 -- Unprivileged, Non-Secure, Data Access
[   22.930892] 	  FALCONSEC		: 0x0
[   22.934050] 	  Virtual Queuing Channel(VQC): 0x0
[   22.938678] 	**************************************
[   22.943802] ------------[ cut here ]------------
[   22.948227] kernel BUG at /Linux_for_Tegra/l4t_new/sources/kernel/nvidia/drivers/platform/tegra/cbb/tegra19x_cbb.c:553!
[   22.958991] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[   22.964583] Modules linked in:
[   22.967758] CPU: 0 PID: 121 Comm: kworker/0:3 Not tainted 5.10.65-tegra #2
[   22.974384] Hardware name: Unknown NVIDIA Jetson Xavier NX Developer Kit/NVIDIA Jetson Xavier NX Developer Kit, BIOS r34.1-975eef6 05/16/2022
[   22.987219] Workqueue: events deferred_probe_work_func
[   22.992404] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[   22.998212] pc : tegra194_cbb_error_isr+0x198/0x1b0
[   23.003471] lr : tegra194_cbb_error_isr+0xcc/0x1b0
[   23.008322] sp : ffff800010003b40
[   23.011396] x29: ffff800010003b40 x28: 0000000000000001 
[   23.016820] x27: ffffb2b7ee88c598 x26: 0000000000000080 
[   23.022585] x25: ffffb2b7ef1c9420 x24: 0000000000000001 
[   23.027922] x23: ffffb2b7eead6000 x22: 000000000000000f 
[   23.033181] x21: ffffb2b7ef044e60 x20: 0000000000000005 
[   23.038857] x19: ffffb2b7ef044f38 x18: 0000000000000010 
[   23.043951] x17: 0000000000000007 x16: 0000000000000068 
[   23.049455] x15: ffff2237921abf70 x14: 0720072007200720 
[   23.054882] x13: 0720072007200720 x12: 0720072007200720 
[   23.060303] x11: 0720072007200720 x10: 0720072007200720 
[   23.066078] x9 : 0720072007200720 x8 : 07200720072a072a 
[   23.071241] x7 : 072a072a072a072a x6 : c0000000ffffefff 
[   23.076842] x5 : 0000000000057fa8 x4 : ffffb2b7eedd78a0 
[   23.082002] x3 : 00000000ffffffff x2 : ffffb2b7ed2a6d10 
[   23.087599] x1 : ffff2237921aba00 x0 : 0000000100010100 
[   23.092934] Call trace:
[   23.095396]  tegra194_cbb_error_isr+0x198/0x1b0
[   23.099953]  __handle_irq_event_percpu+0x60/0x2a0
[   23.104488]  handle_irq_event_percpu+0x3c/0xa0
[   23.108723]  handle_irq_event+0x4c/0xf0
[   23.112715]  handle_fasteoi_irq+0xbc/0x170
[   23.116773]  generic_handle_irq+0x3c/0x60
[   23.120764]  __handle_domain_irq+0x6c/0xc0
[   23.124967]  efi_header_end+0xa8/0xf0
[   23.128471]  el1_irq+0xd0/0x180
[   23.131432]  __do_softirq+0xb0/0x3e0
[   23.135120]  irq_exit+0xc0/0xe0
[   23.138087]  __handle_domain_irq+0x70/0xc0
[   23.141857]  efi_header_end+0xa8/0xf0
[   23.145566]  el1_irq+0xd0/0x180
[   23.148527]  tegra_dpaux_update.isra.0+0x30/0x50
[   23.153060]  tegra_dpaux_pinctrl_set_mux+0xf8/0x130
[   23.158133]  pinmux_enable_setting+0x10c/0x280
[   23.162422]  pinctrl_commit_state+0x94/0x170
[   23.166722]  pinctrl_select_state+0x34/0x50
[   23.170916]  pinctrl_bind_pins+0x14c/0x160
[   23.175141]  really_probe+0x8c/0x3d0
[   23.178691]  driver_probe_device+0x5c/0xc0
[   23.182829]  __device_attach_driver+0x88/0xc0
[   23.186832]  bus_for_each_drv+0x88/0xe0
[   23.190937]  __device_attach+0xf0/0x150
[   23.194621]  device_initial_probe+0x24/0x30
[   23.198817]  bus_probe_device+0x9c/0xb0
[   23.202956]  deferred_probe_work_func+0x88/0xc0
[   23.207571]  process_one_work+0x1c0/0x4a0
[   23.211789]  worker_thread+0x1f8/0x420
[   23.215444]  kthread+0x148/0x170
[   23.219035]  ret_from_fork+0x10/0x18
[   23.222739] Code: a9425bf5 a9446bf9 a8c67bfd d65f03c0 (d4210000) 
[   23.228948] ---[ end trace 2f85a3e4cbe9a331 ]---
[   23.233470] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
[   23.240822] SMP: stopping secondary CPUs
[   23.244930] Kernel Offset: 0x32b7dd0f0000 from 0xffff800010000000
[   23.250792] PHYS_OFFSET: 0xffffddcac0000000
[   23.254928] CPU features: 0x8240002,03002a30
[   23.259535] Memory Limit: none
[   23.278059] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt ]---

Note the different call trace (CPU_Idle vs. dpaux).

The issue is also occuring with the default L4T-SDCard Image and/or the L4T DevKit Default Flash Config (jetson-xavier-nx-devkit.cfg), so it is probably no issue with our config.

Any ideas on that? Is the problem known? How can we get more information on whats happening (especially regarding irq15)? Or is Jetson Xavier NX in fact not yet fully supported?

This is very urgent because our customer needs support for Kernel 5 as fast as possible and cannot continue their business if it won’t work under the new kernel.

nosscall+sdmmc.txt (98.1 KB)
nosscall+nosdmmc+eepromoff.txt (387.9 KB)

1 Like

At this moment I can only say

  1. Please be aware that jetpack 5.0.1 is still a DP version. This is “developer preview” version. I don’t think this is good idea to let your customer use a preview version to do the production. This code branch is already deprecated.

  2. We don’t see other customer reporting this error on their NX + jetpack5. I don’t know what you did to “tegra19x” version in your kernel config. Since jetpack/sdkmanager already supported NX, we are sure this software can work on NX devkit.

  3. Since your board is not devkit, I would suggest you porting your change for your board one by one and see what is causing problem.
    I don’t know your board, so it is unlikely to directly share what goes wrong here. Analyzing CBB error is also not helping since the log from it would only give a very general error. It won’t tell what driver or device might be wrong. For your case, you can see if it is always happening when usb is probing.

  1. we don’t just “let them use” it, they are aware of the DP state. But due to an issue with the WIFI modules that exists in the 4.9.x kernel used by jetpacks <5, we all hardly wait for a kernel 5 included in the NVIDIA BSP.
    I mean, kernel.org released the 5.0 version already 2019! We did not expect to wait such a long time for a 3 year old kernel!
  2. as i already mentioned in the linked post, there was no kernel modifications made to the tegra_defconfig. Also, the corresponding kernel options which are in question can be found there. So, let me reformulate the question: what does CONFIG_ARCH_TEGRA_19x_SOC if it is not needed for Xavier NX SOCs?
  3. we already ported our board changes successful (beside the WIFI issue, see above) with JP4 (L4T 32.5.1), so it seems strange to repeat this for every new major JP release!

For the usb hint, we also recognised this timely coincidence - the kernel throws a panic each time right after some usb enumerations. This is happening regardless of the USB Controller (SOC integrated or attached via pci).

We think it is probably sth. in conjunction with disabled SSC on one PCI Channel (CH5), because the image works on the DevKit with SSC enabled.
In this post, the moderator gives the hint that disabling SSC on plle can lead to USB issues.
We did not deactivate it for plle (CH4) but for pllnvhs (CH5), so we did not expect getting in trouble with USB.

Beside these two clocks, there are several failures during clock assigns and MEMIO mappings for certain components in the logs, which are also occurring with devkit and default image, so we were not sure what is the current status of the DP regarding clocks, mem and thus tegra19x support.

Where are the screws which we can adjust the timing behaviour, so we could avoid the panic?
It seems, that if the DPAUX write errors occur in early boot, the error is thrown more often.
Can it be the cpu-boost option in the USB dtb tree? Or sth. else we don’t have in mind…?

HI,

If you think it is from usb, how about just disable usb from the device tree first and see if the panic will be gone?

How to disable it correctly? Is it just enough to deactivate the ports and lanes in the xusb_padctl or the whole xhci stuff?

To make clear: we have done all we know the last weeks to fix or analyse this error, so “try just to disable it” sounds like you guess i just shout out all errors we get without trying to fix them ourselves!
This is the case in all discussions we had in past!!!

So, how to disable USB by dtb?

The following procedure does not work:
sudo fdtput -t i /boot/kernel_tegra194-p3668-0001-ac10x-0000.dtb /xhci@3610000 status "disabled"

After reboot:

jan@jan-jetson-dbg:~$ sudo fdtget /boot/kernel_tegra194-p3668-0001-ac10x-0000.dtb /xhci@3610000 status
65535

but all usb devices are still discovered:

jan@jan-jetson-dbg:~$ lsusb
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 003: ID 046a:0011 Cherry GmbH G83 (RS 6000) Keyboard
Bus 001 Device 002: ID 046d:c045 Logitech, Inc. Optical Mouse
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

Hi,

You can try to disable all lanes/ports of usb, xhci, xudc nodes. For more detail, you can check the adaptation guide.

The usb part is same on Orin and NX.

BTW, you can use /proc/device-tree to make sure your dtb change is really taking effect or not. If not, need to check if your method to update dtb is correct or not.

Please be aware that I don’t know you and your custom board. You cannot just throw me some errors and ask me for solution. This is not possible.
What I can suggest now is please disable usb ports first and see if issue is still there.

If issue is gone, then we dig into usb to see why it is causing cpu hang.

Ok, i deactivated USB with the following DTS additions:

&xusb_padctl {
	ports {
		usb2-0 {
			status = "disabled";
		};
		usb2-1 {
			status = "disabled";
		};
		usb2-2 {
			status = "disabled";
		};
		usb3-2 {
			status = "disabled";
		};
	};
	
	pads {
		usb2 {
			lanes {
				usb2-0 {
					status = "disabled";
				};
				usb2-1 {
					status = "disabled";
				};
				usb2-2 {
					status = "disabled";
				};
			};
		};
		usb3 {
			lanes {
				usb3-2 {
					status = "disabled";
				};
			};
		};
	};
};

&tegra_xhci {
	status = "disabled";
};

&tegra_xudc {
	status = "disabled";
};

The mods seem to disable USB successfully, but the issue is not gone (see log).

How to proceed now? I guess we have to find the resource, for which the Interrupt was requested - the stack trace gives no hint to this.

nosscplle+nousb.txt (49.6 KB)

Hi,

Could you share your full dts here? Also, how many display is in use on your carrier board?

We tested with one and no display connected.
The error that appears in logs is similar to this post, so HPD is not working correctly.

ac10x.dts (305.0 KB)

Could you disable the dpaux nodes which are not in use?

…done, now the stack trace seems to be different, but panic remains (s. log below).

nousb+nodpaux-def2.txt (55.1 KB)

In above log, you can’t see the message for the DP-AUX disabled message, here is another boot log with same cfg and DPAUX disabled message:
nousb+nodpaux-def.txt (75.6 KB)

Hi,

Could you share me what does current device tree look like?

ac10x-nodpaux.dts (305.0 KB)

I am not sure about your purpose.

Why do you still leave dpaux@155F0000 enabled there? There is no display using that either.

Ok, my attention was only on dpaux0 & 1 because i assumed that the head & sor which uses it is disabled.

After disabling all other dpaux nodes, the system boots up without kernel panic.
nousb+nodpaux0-3-def.txt (64.0 KB)

Now, it takes very long until the OEM Setup appears, because i2c messages regarding timeouts spam all over the log…
So, what does happening here? Misconfiguration of i2c channels which are used for dpaux?

The problem is somehow the dpaux@155F0000 register is getting accessed.

Now it is gone after disabling this unused dpaux node.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.