As mentioned earlier, if one tries to migrate IRQ for an NVMe on PCI, but the migration fails, or an attempt is made to migrate during an atomic part of the code, you could conceivably get the issues you are seeing (it isn’t a guarantee). If this involves a carrier board, e.g., the PCIe slot of a carrier board, then this could also conceivably be related to device tree (but only if that IRQ migration is valid to start with). However, if you get the same issue for both a third party carrier board and the dev kit (you mentioned AGX Xavier dev kits also have this issue), then this is not likely to be related to device tree (we know the device tree of the dev kit is correct). However, migration across cores of an IRQ which should not try to migrate, or which can migrate, but is doing so in an atomic section of code, would occur the same regardless of carrier board.
Is there any way you can try using (reflash) one of the units without the NVMe? For example, maybe for testing you could flash one of the dev kits to a USB SATA drive. If the issue goes away when everything is the same, except that the NVMe on PCI won’t try to migrate, then you’ve narrowed it down.
The Advantech units flash to eMMC for boot. Innodisk Industrial 2TB SSD holds app data as well as some container data. We flash their image and install the nvidia-jetpack metapackage afterwards.
The Tegra AGX Xavier flash to eMMC for boot. Samsung SM981 holds app data as well as container data.
We flash these with Jetpack and install packages afterwards with the nvidia-jetpack metapackage
We’ve also installed these once before without using the metapackage and instead choosing each package we required from Jetpack individually.
Just a reminder. We just focus on rel-35.4.1/jetpack5.1.2 and devkit for now. Do not put too much effort on old release and Advantech board.
Please help remove all the peripherals on your devkit for this moment. Leave only the uart cable, power cable and boot up the device. If you have any pcie device, remove them for now.
Tell us whether this issue is still there or not in this situation.
If it is, use same test scenario and test with other modules. And see if this is really specific to some modules.
Also, do you need to run any application to hit this issue? For example, I saw the first log indicates the issue got hit in timestamp “571143.022629” which means your device has run for few days to hit this issue.
It seems not a easy one to get hit.
Unfortunately, our test will not run without a network connection as it requires a kubernetes cluster of 5 devices, additionally we need the nvme drive or some other data drive attached to run–without that I don’t think we would even be able to fit the application container images on the device. We potentially move our data storage to a usb drive instead of the nvme
And yes, as I said, the application will generally run fine for hours before encountering this kernel panic.
You need to at least isolate whether this issue is related to some specific modules first.
Confirm whether it is hardware problem or software problem so that we can proceed.
I don’t know what is your expectation. But things won’t fix if you cannot provide a method that can repro this issue on devkit.
We won’t be able to just read few logs and give you an patch to fix.
So one of the issues with doing the test you requested is that I have previously been unable to replicate the issue on a single devkit without our full system running in a kubernetes cluster of 5 devices. To confirm this I have been running data through a single devkit with only a rabbitmq instance and our model app running in docker for the past week or so. I was pretty sure it wasn’t going to crash, but today it finally did. Normally the crash occurs within 1-24 hours of running the full system test.
I had also modified this single device test to more closely match the profile for how data flows into the model. I’ve noticed that with the full system, test data was running through our models in pairs of 2 in quick succession with about a 3 second delay between them. This resulted in slower processing time for the first example of the 2. You’d see one item take 130ms to run through the tensorflow model, then the next take about 60ms, then a 3 second delay waiting for data and the process repeats. So I replicated this behavior for my more minimal test. Not sure if it is necessary or not to trigger the kernel panic.
I am having to focus on other tasks right now, but hopefully I can prepare a test that runs everything off the emmc storage soon. I am also concerned about how long the issue takes to occur under minimal circumstances. There is clearly a race condition or some other factor that adds an element of randomness to this… So if the issue doesn’t occur after 1 week, I don’t necessarily know that the problem has gone away… maybe I just didn’t run the test long enough.
Just an update to keep this thread open. I have removed the nvme from one of the dev kits we have, and to save space I am running our application without a docker container. I have detached all peripherals except the UART cable and am running the application on startup using a cronjob so I do not have to interact with the device. I expect a crash in this configuration could take at least as long as a week based on my previous test.
I ran the test for 17 days with no crash. Since it didn’t seem like it was going to crash at that point, I left the test running while I compiled a simple C program on the devkit, to stress the cpu for random intervals from 0-10 seconds and then sleep for a random period between 0-10 seconds, thinking this would introduce some entropy and possibly force a crash. The program just computes 2 random floating point numbers and multiplies them in 4 cores (which is the number of cores that are running).
After running this C program alongside the existing test, it crashed after approximately 4 hours.
I have attached the associated log, the devkit was connected only to power and the UART cable. I would occasionally login over UART to check the uptime, and I also used the UART connection to input (via vim) and compile the cpu stress program (so the output there looks kinda garbled). edge_test.log (338.5 KB)
Could you just share us the test method you are using and we will check it on our side directly?
First, Jetpack already had rel-35.5 in this 2 months. It is pointless to test on rel-35.4.1 anymore.
Second, I don’t see any point to test this on your side anymore. If you confirm you can reproduce issue on devkit, then share us step by step method to reproduce this issue.
And nothing else you can do on your side. Just doing above. Thanks.
I will try to get you something to reproduce the error. I cannot immediately share the code that runs the model without approval.
Also I just rechecked the log and apparently I missed the fact that it crashed 4 hours after starting the test 17 days ago. So apparently the cpu stress program is not necessary, but may help cause the panic if the program is running stable.
It seems that the behavior is highly random, so I cannot say how long it will take to cause a crash. I am hoping that the additional stress utility will make it more likely to occur. I am rerunning overnight with the hope that it will crash before morning.
What occurred over the course of my test is as follows:
I began running the test
4 hours later there was a kernel panic and crash that I did not notice at the time
The test continued running without a kernel panic for 17 days, at which point I added the random CPU stress program without interrupting or restarting the test
4 hours after adding the stress program there was another kernel panic
The test continued running without a kernel panic for 17 days, at which point I added the random CPU stress program without interrupting or restarting the test
Are you talking about he kernel panic in (2) does not lead to system reboot?
It did lead to a system reboot. This can be seen in the log at line 1438:
[15411.675756] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
I did not notice because I was away from the machine at the time, and I miscalculated the expected uptime when I checked later. I set the test to run automatically on reboot, so after it restarted 17 days ago a new test began.
I did notice this in one of the earlier failure logs (el1): [1536360.732866] handle_mm_fault+0xa3c/0x1020
I’m curious if you can monitor memory usage? Maybe it ran out of memory. This isn’t a particularly sophisticated way of doing it, but if you are in a serial console, then you could just background “free” on a timer, and then monitor “dmesg --follow”. Example:
watch -n 10 free &
dmesg --follow
If the magic sysrq is enabled, then you could just use that every so often similar to free. This is more esthetic since it outputs into the dmesg. Try this to see if it is enabled on the Jetson:
Did memory stats show up? Or did you get a message about that sysrq not being allowed? If not allowed, that could be updated, but by default Jetsons seem to have sysrq enabled. If it works, then you could do something like this to log memory once every 30 seconds (that’s probably a lot over 17 days):
If there is memory involved it is just easier to be certain you’re not running out on something that doesn’t allow OOM kill. This can be added to any existing “dmesg --follow” on serial console without interfering. That and simply updating to R35.5 first is probably the fastest route to ruling out other issues.