Hi NVIDIA team,
Do you have any updates on this? Could you reproduce the issue or is there anything else I can help with?
Hi NVIDIA team,
Do you have any updates on this? Could you reproduce the issue or is there anything else I can help with?
Hi,
We tried the stress test that you provided on rel-35.6 but seems not easy to reproduce it.
How long or how many times you’ve tried to reproduce this ?
Hi WayneWWW,
Apologies for the later reply – we have been busy trying to get a release out with the compatibility serial drivers.
I have had a chance to replicate the test again a few times.
Electrical Connection
For these tests, I connect:
Build/Flash OS
This time I did not use a customised filesystem layout, and I used the nvsdkmanager_flash.sh
tool to flash, rather than initrd_flash.sh
.
I start with unpacking the Jetson Linux / sample rootfs, then run apply_binaries.sh:
I then setup the default user:
Then I put the board into recovery mode with the jumpers, and flash with nvsdkmanager_flash.sh
:
Once flashing has completed, I login via SSH and install a few packages (stress-ng, nano, python3-serial):
Next, I modify the extlinux.conf file to include “slub_debug=FZP” at the end of the command line, to better show up out-of-bound writes:
I then rebooted the system for the command line changes to take effect.
I then create serial_stressor.py on the system:
I then start 2x serial port stressors and one stress-ng stressor. I also started capturing the serial debug messages from the UART RX/UART TX pins:
When a fault is reported (a kernel oops or a slub_debug fault), this is printed to the UART TX pin and shows as follows:
Slub debug detected out of bounds write:
Kernel oops:
Test results
The first test hit a kernel oops in 15 minutes of running stress-ng/2x serial port stressors.
The second test hit a slub_debug report 18 minutes after starting the stressors, then a kernel oops after 1 hour of stressors.
I hope this gives enough information to accurately reproduce the fault. It does take some time for the fault to develop, so perhaps it is worth leaving the system running for an hour or two to see the issue occur.
I will suggest showing the output of “lsmod
” as well.
In the Python code you might have more details available if you separate the s.write()
and then encode()
steps. Probably print something to know the last part of the loop which worked. For example, maybe it is really an encode()
error. Presumably it is a write()
error, but you don’t really know which some debug output. A loop count in any debug output might be useful as well.
Hello,
I just managed to reproduce this by following the post above in the development kit.
It seems to break both if I flash the NVMe or the eMMC.
First failure within ~7 minutes of running the tools:
[ 431.219392] =============================================================================
[ 431.219640] BUG kmalloc-256 (Tainted: G O ): Poison overwritten
[ 431.219801] -----------------------------------------------------------------------------
[ 431.219801]
[ 431.220001] Disabling lock debugging due to kernel taint
[ 431.220132] INFO: 0x00000000c01cc8b2-0x000000005db402b7 @offset=7964. First byte 0x6 instead of 0x6b
[ 431.220324] INFO: Slab 0x00000000672aece3 objects=21 used=21 fp=0x0000000000000000 flags=0x8000000000010200
[ 431.220530] INFO: Object 0x00000000a13449be @offset=7936 fp=0x000000006ed1c27a
[ 431.220530]
[ 431.220717] Redzone 0000000014ad01b9: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.220917] Redzone 00000000204ddab8: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.221115] Redzone 00000000b82ab49b: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.221708] Redzone 000000003573d9be: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.224838] Redzone 000000003bb9bb43: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.234550] Redzone 00000000b726c626: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.243831] Redzone 00000000500b4578: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.253626] Redzone 00000000288277cf: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.263163] Redzone 00000000cf53d836: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.272700] Redzone 00000000fd938235: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.282239] Redzone 00000000a7668edb: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.291776] Redzone 0000000006b2ed67: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.301057] Redzone 0000000042fdbd9e: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.310593] Redzone 000000003dfe14ed: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.320144] Redzone 00000000d38dcb7d: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.329926] Redzone 00000000fb00a533: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 431.339471] Object 00000000a13449be: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 431.348743] Object 00000000e602edda: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 06 71 ff ff kkkkkkkkkkkk.q..
[ 431.358540] Object 00000000764736f6: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 431.367818] Object 0000000084c1923c: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 431.377615] Object 000000005a94cf2c: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 431.386894] Object 00000000787c00b6: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 431.396688] Object 000000009357e4c0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 431.405969] Object 000000008b30a5a1: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 431.415764] Object 00000000af7ea04a: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 431.425302] Object 00000000bdc8268d: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 431.434839] Object 00000000c1722b72: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 431.444118] Object 00000000427067dd: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 431.453912] Object 00000000e7a9561e: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 431.463451] Object 0000000099e2fed7: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 431.472731] Object 000000008ecaae7c: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 431.482526] Object 000000008abe5b2b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk.
[ 431.492063] Redzone 000000004664bd27: bb bb bb bb bb bb bb bb ........
[ 431.500813] Padding 000000003bbcfbf3: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.510350] Padding 00000000866e9d96: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.519888] Padding 00000000e0eb992e: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.529426] Padding 000000009b40b6d5: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.538706] Padding 00000000dabd263c: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.548501] Padding 0000000080540612: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.558038] Padding 00000000d2f72e72: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.567664] Padding 00000000680509ab: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.576945] Padding 0000000072a63612: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.586739] Padding 00000000798dea41: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.596275] Padding 000000009f1a9b7c: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.605556] Padding 000000005630a0ed: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.615093] Padding 0000000086f7291b: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.624888] Padding 00000000c51f5ec8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.634427] Padding 000000005c15de4e: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 431.644319] FIX kmalloc-256: Restoring 0x00000000c01cc8b2-0x000000005db402b7=0x6b
[ 431.644319]
[ 431.652988] FIX kmalloc-256: Marking all objects used
The reflashed the dev kit and now using eMMC the issue also shows after ~15 minutes.
[ 906.691003] BUG kmalloc-256 (Tainted: G O ): Poison overwritten
[ 906.691178] -----------------------------------------------------------------------------
[ 906.691178]
[ 906.691523] Disabling lock debugging due to kernel taint
[ 906.691655] INFO: 0x000000003b0f5a5e-0x0000000098f02217 @offset=9500. First byte 0x32 instead of 0x6b
[ 906.691854] INFO: Slab 0x00000000b34624fa objects=21 used=21 fp=0x0000000000000000 flags=0x8000000000010200
[ 906.692057] INFO: Object 0x00000000d69e77a5 @offset=9472 fp=0x000000005e895e23
[ 906.692057]
[ 906.692286] Redzone 00000000b475ce4d: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.692484] Redzone 00000000c1364726: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.692692] Redzone 0000000012202fb1: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.692999] Redzone 00000000d64cb704: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.696404] Redzone 00000000e1410cc6: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.705691] Redzone 000000004e301b16: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.715480] Redzone 000000004abe0904: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.724760] Redzone 0000000059f941f6: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.734300] Redzone 00000000dc4f82d3: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.744092] Redzone 00000000319e70e2: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.753631] Redzone 00000000caf36324: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.763167] Redzone 00000000aed89b37: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.772448] Redzone 0000000038185c10: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.781984] Redzone 0000000035d3aa3e: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.791525] Redzone 00000000a1e8b3c4: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.801059] Redzone 000000005d259d1c: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 906.810883] Object 00000000d69e77a5: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 906.820135] Object 00000000ab98a2c2: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 32 1d ff ff kkkkkkkkkkkk2...
[ 906.829693] Object 000000007c155d0d: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 906.839215] Object 000000000013e29a: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 906.848775] Object 00000000c613ece2: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 906.858287] Object 000000005502ecec: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 906.868079] Object 0000000057d2318d: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 906.877361] Object 000000008b0327b3: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 906.886897] Object 00000000babb08bd: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 906.896691] Object 00000000ef487e48: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 906.906228] Object 0000000079d49926: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 906.915788] Object 00000000a2238603: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 906.925306] Object 000000000a12a53b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 906.934842] Object 00000000f6e3cadc: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 906.944380] Object 00000000d2bc318d: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 906.953942] Object 000000003f53bc23: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk.
[ 906.963197] Redzone 000000006def9f81: bb bb bb bb bb bb bb bb ........
[ 906.972205] Padding 00000000053b9ce6: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 906.981484] Padding 00000000928317e7: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 906.991022] Padding 00000000d88a3e40: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 907.000586] Padding 0000000067addccb: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 907.010096] Padding 00000000ac8bfc4f: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 907.019635] Padding 000000000b78c416: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 907.029450] Padding 0000000071d3186e: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 907.039054] Padding 00000000eac98d35: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 907.048592] Padding 000000005f025c50: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 907.058130] Padding 00000000374a98c5: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 907.067668] Padding 0000000071da1e29: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 907.076946] Padding 00000000817f03d5: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 907.086483] Padding 00000000dc701e47: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 907.096025] Padding 00000000bcbcafff: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 907.105581] Padding 00000000ae617ad9: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 907.115417] FIX kmalloc-256: Restoring 0x000000003b0f5a5e-0x0000000098f02217=0x6b
[ 907.115417]
[ 907.124121] FIX kmalloc-256: Marking all objects used
Are you not able to reproduce this issue yourselves? I am not doing anything else more than the post here and took me ~1 hours end to end with flashing and whole setup.
I think you have a bug in the system which critically affects the serial port in this release and you should be able to reproduce it yourselves as it doesn’t take any custom software to trigger the issue.
This is regardless of serial port permissions, debug logs / counters you can add / not add to the tests, etc…
I don’t personally have an Xavier NX to work with. While it is true that the main issue is being able to reproduce the issue, one then has to find out where the specific line of code is which has gone bad. The reason I mentioned separating .write()
and .encode()
is to see more specifically what triggers this.
One other issue which makes this more difficult is that some of the log posts don’t seem to include the full message. The reason I say this is because the actual stack frame (which occurs right after the " ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
" can either (A) show consistent trigger at some particular part of the kernel, or (B) show that it isn’t one particular stack frame and thus an interaction between processes instead of a specific process running into this.
If someone at NVIDIA does reproduce this, but the trigger varies and is inconsistent, then it takes longer to figure it out. Consider for example that this appears to be related to stack memory (I don’t know for sure, but it looks that way). In that case one might see different stack frames which depend on the data. Or it might depend on whether debug symbols are present (which changes the content on the stack frame). It is a good idea to post logs of issues such that what ran just before the error and just after the error are also visible.
We reviewed the steps and do the test again. Running for 3 hours but still not kernel panic happened.
Basically the test method was still same since the last time we tried.
The only other thing I can think might be different is that we have an NVMe drive installed in the dev kit. Do you also have that installed when you run the test?
We don’t connect NVMe SSD on the Xavier NX devkit during the test.
Would you hit the issue when you remove the NVMe to check if the issue is relating to NVMe?
I am just curious, is any driver related to this in the form of a kernel module? If it is, then is the same driver present both in “/lib/modules/$(uname -r)/
” of both the rootfs and the initrd
? The modules in the initrd
could conceivably be different than those in /
. So could the firmware (such as device tree). If the module you expect to load differs from what actually gets loaded (such as due to an initrd
not updating with the same module) it could cause some rather bizarre errors. Using an NVMe as rootfs tends to mean you had to run an initrd
flash.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.