Does Nvidia use any kernel static and runtime analysis tools (semi-rant, plus findings on issues on ...

I’ve been debugging some of my own driver additions to the 4.4 kernel release with the 28.1 L4T driver package (I’m using a CTI Spacely carrier board, which only officially supports the 28.1 kernel. I tried booting the 28.2 kernel with the CTI patches, but it was having early boot problems and CTI disabled the serial output during early boot so I wasn’t able to debug that [and they don’t know or won’t tell how to enable it, something about TX1 and TX2 compatibility]). While debugging, I enabled KASAN to make sure my driver isn’t doing weird stuff with memory, but what I discovered is a large quantity of problems from Nvidia specific kernel code, things like use-after-frees and accessing global variables and overrunning them, etc. Looking into these problems (which didn’t take that long, even when considering that this is my first time using KASAN and debugging the kernel in this fashion), I discovered errors. For example, some of the use-after-frees were happening in platform drivers that were deferred but some driver along the way had defined its of_device_id table as __initconst, so it was freed by the time the probe was retried. The global variable overrun was happening due to a lack of a terminating {} sentinel in an of_device_id struct array in a driver written by Nvidia (drivers/crypto/tegra-hv-vse.c specifically). Funny thing, this last issue in tegra-hv-vse.c is a problem that coccinelle caught. After learning that coccinelle can catch a whole set of issues within the kernel (with some false positives), I decided to run it, and there were a lot of issues scattered about (I’m still analyzing them).

Does anyone at Nvidia actually bother with debugging their work? There are so many tools available for debugging the kernel (KASAN, coccinelle, sparse, and others, see https://www.kernel.org/doc/html/v4.15/dev-tools/index.html for a longer list, more modern list or kernel-4.4/Documentation/development-process/4.Coding for something already in the kernel distributed by Nvidia).

Since I don’t like to complain without offering some solution/help, is there a process to submit patches to Nvidia? I’d have to go through my company’s open source submission process (heaven knows how long that’s going to take), but it would be nice to have code like this be fixed for everyone, instead of having every developer, like me, outside of Nvidia re-doing the same work.

Alright, I may have been a bit sour after dealing with these errors.

In Nvidia’s defense, it looks like a lot of the KASAN errors I was seeing came from either the __initconst stuff I mentioned earlier, or two struct arrays missing a sentinel {} (one of_device_id related, the other was something about PCI). Taking care of these issues made all of the KASAN errors go away.

Coccinelle brought up a couple of more problems within the Nvidia code (using kfree to free devm allocated memory, for example), but most of the output seemed to just be suggestions (such as that setting .owner for the driver struct was necessary for a whole bunch of cases since platform driver initialization sets it up, and that memdup_user could be used to get rid of some boilerplate code to copy data coming from userland). That was the most noteworthy of the messages pertaining to the Nvidia code. There were a lot of warnings (and some errors, iounmap related) also emitted for some of the mainline code by Coccinelle, so there’s that also.

Hi,

Thanks for your post. Jut for the record, KASAN is regularly run on NVIDIA code and appropriate alarms are fixed. If you want to contribute to L4T Jetson code, you can either send your patches through the forum or through Nvidia Customer relationship channel.

If you are looking for any particular help regarding Jetpack release and kernel, please post your query here.

thanks
Bibek

Thanks for the reply.

Yeah, I acknowledge I may have been angry when I wrote my first post. Only two of the KASAN errors I encountered I’m a little surprised Nvidia didn’t catch (the missing {} sentinel at the end of structure arrays). The __initconst stuff I’m pretty sure I triggered because normally none of the Nvidia drivers support EPROBE_DEFER to delay probing, so in theory everything has initialized by the time the kernel frees up its memory. I had to add deferred probing support because this board (CTI Spacely) for some reason takes a long time initializing its i2c slave devices (actually, they have a hard-coded 15 second delay for their pca593x driver! I replaced that with EPROBE_DEFER). For more information on this second problem, refer to this thread on LKML: https://lkml.org/lkml/2017/12/3/102 (doesn’t seem like much came out of that discussion other than “yes, that’s a problem”).

As an aside, if you haven’t already, I’d recommend finding a way to run Coccinelle on the kernel at least once a month during development. Running all of the tests takes a long time (a few days?) and while there are false positives, there are also actual issues it catches.

Once I get these patches organized and working, I’ll probably send them to Nvidia, that way we don’t duplicate work. Thanks!

Be aware that errors on “__initconst” are not an error with earlier compilers…the compilers originally used (4.8 series) are very old and mostly won’t care about this for many cases. Newer compilers do.

I’m using my own cross compiler, GCC 7.3.0. the __initconst errors I’m seeing are at runtime, more specifically, they are use-after-free errors being detected by KASAN, which are legitimate if one is using drivers that had their initialization deferred-- since the re-probe could then occur after the kernel has freed init code and data.

I was hoping to use deferred initialization to fix some driver dependencies woes with the CTI Spacely board, but at this point I’ve given up on that effort. There seems to be too much that assumes there isn’t deferred initialization of drivers (I ran into problems between our custom IMX driver (deferred) initializing after the tegra-vi4 driver, which seemed to cause no /dev/video0 being created).