As this is a device which has probably more runtime than any of our other devices, and our application writes a lot of logs, I assumed the eMMC might be worn out. Hence, I checked with mmc extcsd read /dev/mmcblk0. Here’s the output:
=============================================
Extended CSD rev 1.8 (MMC 5.1)
=============================================
... removed for brevity ...
eMMC Life Time Estimation A [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A]: 0x01
eMMC Life Time Estimation B [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B]: 0x01
eMMC Pre EOL information [EXT_CSD_PRE_EOL_INFO]: 0x01
According to the this document, this should mean the eMMC is worn out 0-10%. But this of course requires the SoM actually supports reporting correct values. It could report dummy values, after all.
Hence, my question are:
Is the Jetson TX2 reporting correct values to mmc command?
Can I reliable assume eMMC is NOT worn out, if the values reported to the mmc command are < 0x08, that is < 80-90%?
According to this articleTYP_A refers to the SLC and TYP_B refers to the MLC blocks of the eMMC. That both values are unequal 0x00 suggests that the eMMC is using both memory technologies. True? was not aware this is even possible, I thought an eMMC can either be of this or that.
Does the eMMC do wear-leveling? As said, our application logs a lot and the file it logs to is at a specific place. When we write a lot into this file, are only this files blocks worn out, or is the wear spread over the whole eMMC?
Does the eMMC support the TRIM command?
Does the eMMC also do wear-levelling if it gets not informed about unused blocks via the TRIM command?
Context information:
Our device is running a self-written GNU/Linux distribution created using the Yocto project and meta-tegra with L4T 32.4.3.
I can’t answer, but this combination tends to suggest this is a software problem, and not hardware (though possibly influenced by slow hardware response):
Scheduling while atomic implies something trying to run and preempt atomic code sections, and the nature of atomic is to not allow this. More likely it is a programming error in some rare corner case. There is a possibility that such a corner case only occurs when the hardware responds slower than usual, and statistically speaking, the longer it runs the more likely it’ll hit one of the slowest response cases. Don’t know, and it could be hardware, but hardware really shouldn’t have any control over an attempt to preempt an atomic code block.
I do see the overlay driver is installed. Is there any overlayfs being used? I ask because not many people use this, and perhaps if there were such a rare corner case, then it would be useful to know if something new is present.
First of all, thanks input. That really is helpful.
The developer collegue who reported this said the problem only occured when the ethernet cable is not connected. Our device also has WiFi connectivity, hence a disconnected ethernet cable does cause a different codepath inside the WiFi driver to be taken. We couldn’t observe this problem on any other of our devices, though.
I’m no kernel dev, but kernel development is something I am interested in. If I would like to dig deeper, is there something to read/learn/whatever you can recommend? How would one tackle this? Perhaps my company will give me some time to work on this.
I do see the overlay driver is installed. Is there any overlayfs being used? I ask because not many people use this, and perhaps if there were such a rare corner case, then it would be useful to know if something new is present.
This won’t be in any specific order, and not really arranged well, but some things to research and learn follow…
Inside the kernel drivers and other software use hardware addresses…addresses on a bus which actually talk to devices. User space (outside of the kernel) instead has virtual addresses assigned. These addresses are something the memory controller assigns, and are translated back and forth when needing a physical address. The first thing to note in the bug is that it has an inability to write to a virtual address. So it wasn’t a driver to hardware failing, it was a misuse of an address which the memory controller does not believe to be valid (or else was valid but somehow locked out from response).
Interrupts are either “hardware” interrupts (using a wire to trigger a hardware bus), or else “software” interrupts. Inside the kernel, whenever a driver to some hardware is called, it is triggered via an ordinary “interrupt”, or hardware interrupt. There are a number of “sort of” drivers which provide functions not requiring access to hardware, e.g., perhaps there is some function implemented to reply with the content of a memory location which has no need to access some add-on hardware. These also trigger use via an interrupt, but this is a “software” interrupt, and shows up only with software logic without the need of any actual wire going high or low. The kernel has a sort of “daemon” to manage soft interrupts, and this is “ksoftirq” (you could call this a scheduler since it operates on time slices instead of wires going high or low). Having an error detected by ksoftirq (a softirq) implies there was a reference to kernel code which is not directly tied to hardware.
Every software running in the kernel is competing with other software for a time to run. The mechanism for doing so is a “scheduler”. As you learned above, the scheduler basically deals with interrupts and will decide whether to save state of something now running and boot it out momentarily for some other process, or whether to let the current activity continue while making the interrupt wait. Some code mandates that it must run to completion, and this is atomic code. Your bug triggered an attempt to schedule (make interruptible) a section of code which cannot be allowed to interrupt. This is a violation, and interrupting such code could cause any number of errors, so it is fatal to whatever tried to interrupt. Usually this sort of error is a software programming error. Trying to synchronize threads in user space is difficult enough, and trying to do so in kernel space is more difficult. Likely you found some code in a corner case and most of the time this attempt to preempt atomic code is rare.
I asked about the overlayfs being linked in because a number of people have had issues getting this running correctly. Many will have made modifications to the kernel itself, and could add bugs if overlayfs required patching. Additionally, not many people are running this, so if there is a corner case, then I’d consider this to be a good starting point if the other systems don’t have this issue.
Note that because interrupts and scheduling is essentially a way to get multiple things working together, and because everything in the kernel can access any part of the kernel, that a programming bug is more likely to have seemingly unrelated pieces of code interfere with another. An example might be the timing of an interrupt which simply differs as to when other drivers are told to stop and wait while others are serviced.
Note that the memory controller knows which user space process is allowed to access which memory, and that sometimes there is “shared” memory. So normally memory would start off in user space as accessible only by itself, but could map in memory which is available for some other process to also read or write. If the memory controller sees an attempt to read or write memory which is not allowed, then there is an exception. Quite often that exception is from user space code trying to use uninitialized memory, or memory which has previously been released. It is also possible that something like shared memory would have an exception if other memory controller activity has not yet released it from within an atomic code section (to prevent corruption of two simultaneous accesses to the same physical memory which the memory controller is mapping). Note that the “virtual” address two different processes might see to access the same shared memory would still be the same physical address, but only the memory controller would know that.
I would guess that if you wanted to study this then you might start by looking at the “interrupt vector table” which starts all physical hardware drivers. When the kernel first loads, and nothing has yet run, and the kernel has just been copied into memory, a first interrupt is triggered. This is where it all begins.
You’d also want to understand hardware interrupts leading to this table of interrupts, and that addresses referred to in the kernel are actual physical memory bus addresses. Thus a kernel can use assembler branch instructions. The biggest example of this are kernel modules which are loaded into memory in a physical address below that of where the kernel loads and any use causes “direct branch” instructions to simply redirect the running point to somewhere that is a module’s physical address.
There are actually a lot of books and tutorials out there. If you wanted to learn “practical” code then I will suggest you start with tutorials on writing kernel modules. These are the least risky to experiment with, the most convenient to experiment on, and lets you get an idea of kernel programming without needing to know a lot about interrupt tables. You might find a tutorial on modules which introduces concepts of atomic versus non-atomic code.
Incidentally, since hardware interrupts require an actual physical wire to trigger, then only a CPU core where the wire can reach can be used for that driver. All CPU cores can execute software interrupts since it is just something that starts in memory and does not talk directly to actual hardware. If you have too much hardware IRQ activity and the available core for that hardware cannot service the interrupt in time, then it is called “interrupt starvation”. Thus it is best to have the smallest possible piece of code run during a hardware IRQ, and then to hand off any other function not directly bound to the hardware to a soft IRQ. I mention because although this is not the bug you ran into, there is a resemblance between code blocked from running because there simply are not enough resources, versus code blocked from running due to a bug trying to access something in an atomic code section. It is quite possible to see an atomic code section in both hardware and software drivers.
Mind = blown. Well, thank you!! again. Knowing about these basics will kickstart my attempts to work on the linux kernel. Very much appreciated!
Considering what you said, especially that the problem could be in some unrelated part of the kernel, I now assume the WiFi kernel module to be the culprit. It’s causing problems for months. While booting our Jetson TX2-based it already restarts 3 times. It also restarts everytime it leaves coverage of a WiFi network and needs to roam to another one.
A test case would probably be to unload that kernel module and see if that gets rid of the kernel panics. Testing is just a bit hard, because these kernel panics are very sporadic. That is, sometimes they appear constantly, and then not at all for quite some time.
@ nvidia support:
This conversation has diverted from the initial topic a lot, but I would still ask you to answer the questions in the initial post of this thread. Couldn’t find details on the eMMC wear levelling anywhere and am still interested.
This is quite possible. Two different software sections need to interact before one can try to illegally change context within an atomic code section. The question would be whether WiFi triggers a bug in the other driver, or whether the other driver is reacting to something the WiFi has done illegally related to forcing preemption in other code’s atomic section.
Unfortunately I cannot answer the wear leveling question. All I can say is that the bug is unlikely to be hardware-related.
There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks
Hi,
The sysfs nodes populated to dump EOL and Device life estimation report values correctly. Attempt to read these sysfs nodes sends CMD8 to the device to get latest EOL/Device life info.
Wear leveling is in the scope of eMMC device internal firmware. Not in the scope of SDMMC driver. Device internal firmware ensures that the cells across are worn out evenly even with multiples writes to the same file.
Based on device life estimation and EOL data, the device is not worn out. The cause of error could be something else. Could try to dump the full error log instead of partial one?
thanks for the feedback. Unfortunately I have no complete error log at hand. The issue didn’t reappear since, but I also have to admit I didn’t try too hard. Other tasks came up in the mean time which are considered of higher importance. If this receives attention again, I’ll open a new thread as you suggested. As we can rule out the eMMC now, seperating it from this thread is probably a good thing to do anyway.