Possible bug in CryptoAPI due to tegra_xxx functions

Running MPPE over a PPP link (using MS-CHAP non-V2) runs into a kernel bug panic, probably because some tegra_xxx functions are not interrupt safe as the kernel expects:

Running BSP 32.5.1 (kernel 4.9.201-tegra) on a Jetson AGX Xavier devkit

Could any nVidia developer check that and confirm if there is a simple patch to avoid this problem?

Regards

This is the bug:

[  114.224871] kernel BUG at ../mm/vmalloc.c:1390!
[  114.224997] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
...
[  114.226190] Hardware name: Jetson-AGX (DT)
[  114.226273] task: ffffffc795404600 task.stack: ffffffc7cff78000
[  114.226400] PC is at __get_vm_area_node.isra.10+0x178/0x190
[  114.226509] LR is at get_vm_area_caller+0x54/0x68
[  114.226600] pc : [<ffffff8008212a58>] lr : [<ffffff8008212cbc>] pstate: 00400145
[  114.227031] sp : ffffffc7cff7b590
[  114.227306] x29: ffffffc7cff7b590 x28: ffffffc7da0fb018 
[  114.229984] x27: 0000000000000007 x26: ffffffbebfff0000 
[  114.235568] x25: ffffff8008000000 x24: 0000000000000001 
[  114.240820] x23: ffffff8008bffd9c x22: 0000000000000008 
[  114.245891] x21: 00000000024000c0 x20: 0000000000000008 
[  114.250965] x19: 0000000000001000 x18: 0000007fcb8092a4 
[  114.256739] x17: 0000007f7a5de960 x16: ffffff800825d3f8 
[  114.262201] x15: 000000000079baf5 x14: 0000000000000028 
[  114.268114] x13: ffffffc7cd692e80 x12: 0000000000000001 
[  114.273473] x11: ffffff8009531000 x10: ffffffbf00000000 
[  114.279494] x9 : 00000000fff7f000 x8 : ffffffc7d1c46000 
[  114.285264] x7 : 0000000000851c46 x6 : ffffff8008bffd9c 
[  114.290776] x5 : 00000000024000c0 x4 : ffffffbebfff0000 
[  114.296113] x3 : ffffff8008000000 x2 : 0000000000000008 
[  114.301454] x1 : 0000000000000001 x0 : 0000000000000401 
[  114.306543] 
[  114.308199] Process pppd (pid: 8163, stack limit = 0xffffffc7cff78000)
[  114.314061] Call trace:
[  114.316699] [<ffffff8008212a58>] __get_vm_area_node.isra.10+0x178/0x190
[  114.322815] [<ffffff8008212cbc>] get_vm_area_caller+0x54/0x68
[  114.328160] [<ffffff800879e4f8>] dma_common_pages_remap+0x40/0x90
[  114.334019] [<ffffff80080a10b0>] __iommu_alloc_attrs+0xd8/0x478
[  114.339879] [<ffffff8008bffd9c>] tegra_se_sha_process_buf+0x5ec/0x848
[  114.345821] [<ffffff8008c00104>] tegra_se_sha_op+0x10c/0x1e0
[  114.350903] [<ffffff8008c00234>] tegra_se_sha_digest+0x5c/0x98
[  114.356515] [<ffffff8008404a08>] crypto_ahash_op+0x40/0xa8
[  114.361579] [<ffffff8008404b10>] crypto_ahash_digest+0x30/0x48
[  114.367180] [<ffffff80089eb0cc>] get_new_key_from_sha+0x11c/0x148
[  114.373384] [<ffffff80089eb154>] mppe_rekey+0x5c/0x190
[  114.378634] [<ffffff80089eba58>] mppe_init.part.1+0xd8/0x228
[  114.383977] [<ffffff80089ebcc0>] mppe_comp_init+0x88/0x90
[  114.389318] [<ffffff80089e3b00>] ppp_ccp_peek+0x188/0x240
[  114.394828] [<ffffff80089e4460>] __ppp_xmit_process+0xe8/0x550
[  114.400947] [<ffffff80089e4d48>] ppp_xmit_process+0x50/0xb8
[  114.406634] [<ffffff80089e65fc>] ppp_write+0x11c/0x158
[  114.411798] [<ffffff800825add8>] __vfs_write+0x48/0x118
[  114.416703] [<ffffff800825bdcc>] vfs_write+0xac/0x1b0
[  114.422035] [<ffffff800825d454>] SyS_write+0x5c/0xc8
[  114.427024] [<ffffff8008083900>] el0_svc_naked+0x34/0x38
[  114.432292] ---[ end trace 5d5f03683597ef6b ]---
[  114.453003] Kernel panic - not syncing: Fatal exception in interrupt
[  114.453146] SMP: stopping secondary CPUs
[  114.453239] Kernel Offset: disabled
[  114.453313] Memory Limit: none
[  114.454055] trusty-log panic notifier - trusty version Built: 08:40:57 Feb 19 2021
[  114.477398] Rebooting in 5 seconds..

hello david.fernandez,

I don’t have experience with MPPE. however, l4t-r32.5.1 is a quite old l4t release version, could you please try moving to latest r32 release, i.e. L4T R32.7.2 for confirmation.
thanks

Right, MPPE is just a link encryption (works like a compression protocol CCP) for PPP, and PPP works like a line discipline for a TTY device, be that a modem, a serial device, or anything of the like.

If you look at the stack trace (at the end of the panic dump), the problem happens because the trace originates off ppp_write.

From the Documentation/serial/tty.txt:

write() - Called to write bytes to the device. May not
sleep. May occur in parallel in special cases.
Because this includes panic paths drivers generally
shouldn’t try and do clever locking here.

Shows that the TTY write path should not sleep, mainly because it acquires irq level locks.

And from include/linux/tty_ldisc.h:

  • ssize_t (*write)(struct tty_struct * tty, struct file * file,
  •        const unsigned char * buf, size_t nr);
    
  • This function is called when the user requests to write to the
  • tty. The line discipline will deliver the characters to the
  • low-level tty device for transmission, optionally performing
  • some processing on the characters first. If this function is
  • not defined, the user will receive an EIO error.

Just to confirm that the ppp_write will be in the middle of the path to the TTY write.

What happens here is that, after the MS-CHAP authentication, MPPE is ready to get its encryption key derived from the credentials and initialize its cypher.

So it does that as part of transferring its first data frame on the link.

BUT… the crypto_ahash_digest called to calculate a SHA-1 hash, goes through some tegra_se_sha_digest call, which I take is an optimization from nVidia to do some cryptographic operations taking advantage of Jetson hardware, but those functions end up trying to allocate memory in a way that sleeps, and the BUG check in vmalloc tests to be sure that it is not being called from interrupt context.

Regarding our version of L4T, unfortunately, those Jetsons are in a satellite and we have no way to reflash them… tried all possible ways to run the flash from one Jetson to another, but seems that some of the flash utility binaries are just intel 32-bit executables with no ARM versions for them, and qemu is not in good shape when running an intel guest in an ARM host, so I am a bit stuck with that version for a while…

I’ll see if I can try the latest version you mentioned in my dev-kit to check if there is a fix for that, but I wonder if, with the information I have provided, you could check if the same function path will still try to allocate memory in the same way anyway, and if there could be a way to patch that easily… could try a patch on the running kernel and see.

Regards
David

hello david.fernandez,

could you please have modification to the tegra_se,
i.e. $TOP/public_sources/Linux_for_Tegra/source/public/kernel/nvidia/drivers/crypto/tegra-se-nvhost.c
please have a try to chang dma_alloc_attrs/dma_free_attrs to dma_alloc_coherent/dma_free_coherent
for example,

--- a/drivers/crypto/tegra-se-nvhost.c
+++ b/drivers/crypto/tegra-se-nvhost.c
@@ -1254,7 +1254,7 @@ static int tegra_se_send_sha_data(struct tegra_se_dev *se_dev,
        unsigned int total = count, val;
        u64 msg_len;

-       cmdbuf_cpuvaddr = dma_alloc_attrs(se_dev->dev->parent, SZ_4K,
+       cmdbuf_cpuvaddr = dma_alloc_coherent(se_dev->dev->parent, SZ_4K,
                                          &cmdbuf_iova, GFP_KERNEL,
                                          __DMA_ATTR(attrs));
        if (!cmdbuf_cpuvaddr) {
@@ -1264,7 +1264,7 @@ static int tegra_se_send_sha_data(struct tegra_se_dev *se_dev,

        while (total) {
                if (src_ll->data_len & SE_BUFF_SIZE_MASK) {
-                       dma_free_attrs(se_dev->dev->parent, SZ_4K,
+                       dma_free_coherent(se_dev->dev->parent, SZ_4K,
                                       cmdbuf_cpuvaddr, cmdbuf_iova,
                                       __DMA_ATTR(attrs));
                        return -EINVAL;
@@ -1347,7 +1347,7 @@ static int tegra_se_send_sha_data(struct tegra_se_dev *se_dev,
        err = tegra_se_channel_submit_gather(se_dev, cmdbuf_cpuvaddr,
                                             cmdbuf_iova, 0, cmdbuf_num_words,
                                             SHA_CB);
-       dma_free_attrs(se_dev->dev->parent, SZ_4K, cmdbuf_cpuvaddr,
+       dma_free_coherent(se_dev->dev->parent, SZ_4K, cmdbuf_cpuvaddr,
                       cmdbuf_iova, __DMA_ATTR(attrs));

if above doesn’t works,
please lower the priority, which bypasses the use of hardware implementation for SHA1.
for example,

static struct ahash_alg hash_algs[] = {
        {
                        ...
                        .cra_name = "sha1",
                        .cra_driver_name = "tegra-se-sha1",
                        .cra_priority = 300,

Thanks Jerry,

Tried the first method, but unfortunately, all patching facilities in the kernel were disabled by default, which we did not realized.

The second worked !!!
Using addresses from System.map (no kallsyms info by default), once we realized that there were two structure arrays to patch… not sure if one is some sort of backup, but both had the same driver name and all that.

At least that will keep us going until we can prepare some way of flashing the Jetsons again.

Cheers
David

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.