Xavier NX R35.6.0 Kernel Oops

Hi NVIDIA team,

Do you have any updates on this? Could you reproduce the issue or is there anything else I can help with?

Hi,

We tried the stress test that you provided on rel-35.6 but seems not easy to reproduce it.

How long or how many times you’ve tried to reproduce this ?

Hi WayneWWW,

Apologies for the later reply – we have been busy trying to get a release out with the compatibility serial drivers.

I have had a chance to replicate the test again a few times.

Electrical Connection
For these tests, I connect:

  • A micro-usb cable from the dev kit to the host.
  • An ethernet cable from the dev kit to our network (only used for downloading a few apt packages)
  • The 12V barrel jack connector
  • The USB ↔ Serial adaptor on UART RX/UART TX pins
  • Some jumper cables to allow me to trigger SYS RESET / RESET FC.

Build/Flash OS

This time I did not use a customised filesystem layout, and I used the nvsdkmanager_flash.sh tool to flash, rather than initrd_flash.sh.

I start with unpacking the Jetson Linux / sample rootfs, then run apply_binaries.sh:

I then setup the default user:

Then I put the board into recovery mode with the jumpers, and flash with nvsdkmanager_flash.sh:

Once flashing has completed, I login via SSH and install a few packages (stress-ng, nano, python3-serial):

Next, I modify the extlinux.conf file to include “slub_debug=FZP” at the end of the command line, to better show up out-of-bound writes:

I then rebooted the system for the command line changes to take effect.

I then create serial_stressor.py on the system:

I then start 2x serial port stressors and one stress-ng stressor. I also started capturing the serial debug messages from the UART RX/UART TX pins:

When a fault is reported (a kernel oops or a slub_debug fault), this is printed to the UART TX pin and shows as follows:

Slub debug detected out of bounds write:

Kernel oops:

Test results
The first test hit a kernel oops in 15 minutes of running stress-ng/2x serial port stressors.

The second test hit a slub_debug report 18 minutes after starting the stressors, then a kernel oops after 1 hour of stressors.

I hope this gives enough information to accurately reproduce the fault. It does take some time for the fault to develop, so perhaps it is worth leaving the system running for an hour or two to see the issue occur.

I will suggest showing the output of “lsmod” as well.

In the Python code you might have more details available if you separate the s.write() and then encode() steps. Probably print something to know the last part of the loop which worked. For example, maybe it is really an encode() error. Presumably it is a write() error, but you don’t really know which some debug output. A loop count in any debug output might be useful as well.

Hello,
I just managed to reproduce this by following the post above in the development kit.
It seems to break both if I flash the NVMe or the eMMC.
First failure within ~7 minutes of running the tools:

[  431.219392] =============================================================================
[  431.219640] BUG kmalloc-256 (Tainted: G           O     ): Poison overwritten
[  431.219801] -----------------------------------------------------------------------------
[  431.219801] 
[  431.220001] Disabling lock debugging due to kernel taint
[  431.220132] INFO: 0x00000000c01cc8b2-0x000000005db402b7 @offset=7964. First byte 0x6 instead of 0x6b
[  431.220324] INFO: Slab 0x00000000672aece3 objects=21 used=21 fp=0x0000000000000000 flags=0x8000000000010200
[  431.220530] INFO: Object 0x00000000a13449be @offset=7936 fp=0x000000006ed1c27a
[  431.220530] 
[  431.220717] Redzone  0000000014ad01b9: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.220917] Redzone  00000000204ddab8: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.221115] Redzone  00000000b82ab49b: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.221708] Redzone  000000003573d9be: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.224838] Redzone  000000003bb9bb43: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.234550] Redzone  00000000b726c626: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.243831] Redzone  00000000500b4578: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.253626] Redzone  00000000288277cf: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.263163] Redzone  00000000cf53d836: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.272700] Redzone  00000000fd938235: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.282239] Redzone  00000000a7668edb: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.291776] Redzone  0000000006b2ed67: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.301057] Redzone  0000000042fdbd9e: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.310593] Redzone  000000003dfe14ed: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.320144] Redzone  00000000d38dcb7d: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.329926] Redzone  00000000fb00a533: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  431.339471] Object   00000000a13449be: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  431.348743] Object   00000000e602edda: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 06 71 ff ff  kkkkkkkkkkkk.q..
[  431.358540] Object   00000000764736f6: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  431.367818] Object   0000000084c1923c: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  431.377615] Object   000000005a94cf2c: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  431.386894] Object   00000000787c00b6: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  431.396688] Object   000000009357e4c0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  431.405969] Object   000000008b30a5a1: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  431.415764] Object   00000000af7ea04a: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  431.425302] Object   00000000bdc8268d: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  431.434839] Object   00000000c1722b72: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  431.444118] Object   00000000427067dd: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  431.453912] Object   00000000e7a9561e: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  431.463451] Object   0000000099e2fed7: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  431.472731] Object   000000008ecaae7c: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  431.482526] Object   000000008abe5b2b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.
[  431.492063] Redzone  000000004664bd27: bb bb bb bb bb bb bb bb                          ........
[  431.500813] Padding  000000003bbcfbf3: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.510350] Padding  00000000866e9d96: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.519888] Padding  00000000e0eb992e: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.529426] Padding  000000009b40b6d5: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.538706] Padding  00000000dabd263c: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.548501] Padding  0000000080540612: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.558038] Padding  00000000d2f72e72: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.567664] Padding  00000000680509ab: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.576945] Padding  0000000072a63612: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.586739] Padding  00000000798dea41: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.596275] Padding  000000009f1a9b7c: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.605556] Padding  000000005630a0ed: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.615093] Padding  0000000086f7291b: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.624888] Padding  00000000c51f5ec8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.634427] Padding  000000005c15de4e: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  431.644319] FIX kmalloc-256: Restoring 0x00000000c01cc8b2-0x000000005db402b7=0x6b
[  431.644319] 
[  431.652988] FIX kmalloc-256: Marking all objects used

The reflashed the dev kit and now using eMMC the issue also shows after ~15 minutes.

[  906.691003] BUG kmalloc-256 (Tainted: G           O     ): Poison overwritten
[  906.691178] -----------------------------------------------------------------------------
[  906.691178] 
[  906.691523] Disabling lock debugging due to kernel taint
[  906.691655] INFO: 0x000000003b0f5a5e-0x0000000098f02217 @offset=9500. First byte 0x32 instead of 0x6b
[  906.691854] INFO: Slab 0x00000000b34624fa objects=21 used=21 fp=0x0000000000000000 flags=0x8000000000010200
[  906.692057] INFO: Object 0x00000000d69e77a5 @offset=9472 fp=0x000000005e895e23
[  906.692057] 
[  906.692286] Redzone  00000000b475ce4d: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.692484] Redzone  00000000c1364726: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.692692] Redzone  0000000012202fb1: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.692999] Redzone  00000000d64cb704: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.696404] Redzone  00000000e1410cc6: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.705691] Redzone  000000004e301b16: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.715480] Redzone  000000004abe0904: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.724760] Redzone  0000000059f941f6: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.734300] Redzone  00000000dc4f82d3: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.744092] Redzone  00000000319e70e2: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.753631] Redzone  00000000caf36324: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.763167] Redzone  00000000aed89b37: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.772448] Redzone  0000000038185c10: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.781984] Redzone  0000000035d3aa3e: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.791525] Redzone  00000000a1e8b3c4: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.801059] Redzone  000000005d259d1c: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[  906.810883] Object   00000000d69e77a5: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  906.820135] Object   00000000ab98a2c2: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 32 1d ff ff  kkkkkkkkkkkk2...
[  906.829693] Object   000000007c155d0d: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  906.839215] Object   000000000013e29a: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  906.848775] Object   00000000c613ece2: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  906.858287] Object   000000005502ecec: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  906.868079] Object   0000000057d2318d: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  906.877361] Object   000000008b0327b3: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  906.886897] Object   00000000babb08bd: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  906.896691] Object   00000000ef487e48: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  906.906228] Object   0000000079d49926: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  906.915788] Object   00000000a2238603: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  906.925306] Object   000000000a12a53b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  906.934842] Object   00000000f6e3cadc: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  906.944380] Object   00000000d2bc318d: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[  906.953942] Object   000000003f53bc23: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.
[  906.963197] Redzone  000000006def9f81: bb bb bb bb bb bb bb bb                          ........
[  906.972205] Padding  00000000053b9ce6: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  906.981484] Padding  00000000928317e7: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  906.991022] Padding  00000000d88a3e40: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  907.000586] Padding  0000000067addccb: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  907.010096] Padding  00000000ac8bfc4f: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  907.019635] Padding  000000000b78c416: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  907.029450] Padding  0000000071d3186e: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  907.039054] Padding  00000000eac98d35: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  907.048592] Padding  000000005f025c50: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  907.058130] Padding  00000000374a98c5: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  907.067668] Padding  0000000071da1e29: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  907.076946] Padding  00000000817f03d5: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  907.086483] Padding  00000000dc701e47: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  907.096025] Padding  00000000bcbcafff: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  907.105581] Padding  00000000ae617ad9: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[  907.115417] FIX kmalloc-256: Restoring 0x000000003b0f5a5e-0x0000000098f02217=0x6b
[  907.115417] 
[  907.124121] FIX kmalloc-256: Marking all objects used

Are you not able to reproduce this issue yourselves? I am not doing anything else more than the post here and took me ~1 hours end to end with flashing and whole setup.
I think you have a bug in the system which critically affects the serial port in this release and you should be able to reproduce it yourselves as it doesn’t take any custom software to trigger the issue.
This is regardless of serial port permissions, debug logs / counters you can add / not add to the tests, etc…

I don’t personally have an Xavier NX to work with. While it is true that the main issue is being able to reproduce the issue, one then has to find out where the specific line of code is which has gone bad. The reason I mentioned separating .write() and .encode() is to see more specifically what triggers this.

One other issue which makes this more difficult is that some of the log posts don’t seem to include the full message. The reason I say this is because the actual stack frame (which occurs right after the " ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---" can either (A) show consistent trigger at some particular part of the kernel, or (B) show that it isn’t one particular stack frame and thus an interaction between processes instead of a specific process running into this.

If someone at NVIDIA does reproduce this, but the trigger varies and is inconsistent, then it takes longer to figure it out. Consider for example that this appears to be related to stack memory (I don’t know for sure, but it looks that way). In that case one might see different stack frames which depend on the data. Or it might depend on whether debug symbols are present (which changes the content on the stack frame). It is a good idea to post logs of issues such that what ran just before the error and just after the error are also visible.

We reviewed the steps and do the test again. Running for 3 hours but still not kernel panic happened.

Basically the test method was still same since the last time we tried.

The only other thing I can think might be different is that we have an NVMe drive installed in the dev kit. Do you also have that installed when you run the test?

We don’t connect NVMe SSD on the Xavier NX devkit during the test.
Would you hit the issue when you remove the NVMe to check if the issue is relating to NVMe?

I am just curious, is any driver related to this in the form of a kernel module? If it is, then is the same driver present both in “/lib/modules/$(uname -r)/” of both the rootfs and the initrd? The modules in the initrd could conceivably be different than those in /. So could the firmware (such as device tree). If the module you expect to load differs from what actually gets loaded (such as due to an initrd not updating with the same module) it could cause some rather bizarre errors. Using an NVMe as rootfs tends to mean you had to run an initrd flash.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.