Hello,
I work for Concurrent Real-Time, Inc. which provides hard real-time OS by adding patches on top of the NVIDIA L4T kernel on TX1, TX2 and TX2i modules.
When we test our “debug” kernel (kernel with most of the debug features turned on), it displays below lockdep message.
[ 62.191228] ======================================================
[ 62.191229] [ INFO: possible circular locking dependency detected ]
[ 62.191232] 4.4.38-rt49-r28.2.1-RedHawk-7.3.3-prt-debug #1 Tainted: G W
[ 62.191234] -------------------------------------------------------
[ 62.191236] NetworkManager/1148 is trying to acquire lock:
[ 62.191491] (net_if_lock){+.+.+.}, at: [<ffffffbffc1a7bc0>] dhd_open+0x40/ 0x2b0 [bcmdhd]
[ 62.191492]
[ 62.191492] but task is already holding lock:
[ 62.191503] (rtnl_mutex){+.+.+.}, at: [<ffffffc000a6a544>] rtnl_lock+0x1c/ 0x28
[ 62.191504]
[ 62.191504] which lock already depends on the new lock.
[ 62.191504]
[ 62.191505]
[ 62.191505] the existing dependency chain (in reverse order) is:
[ 62.191509]
[ 62.191509] -> #1 (rtnl_mutex){+.+.+.}:
[ 62.191518] [<ffffffc000102cc0>] check_prev_add+0x398/0x890
[ 62.191522] [<ffffffc0001032e0>] check_prevs_add+0x128/0x150
[ 62.191526] [<ffffffc000103738>] validate_chain.isra.11+0x430/0x558
[ 62.191531] [<ffffffc000104cd4>] __lock_acquire+0x3b4/0xa48
[ 62.191535] [<ffffffc000105c74>] lock_acquire+0xec/0x268
[ 62.191540] [<ffffffc000c27c7c>] _raw_spin_lock_irqsave+0x54/0x70
[ 62.191546] [<ffffffc000c25c8c>] rt_mutex_slowunlock+0x2c/0xa8
[ 62.191550] [<ffffffc000c26708>] rt_mutex_unlock+0x48/0x88
[ 62.191554] [<ffffffc000c28cc4>] _mutex_unlock+0x34/0x40
[ 62.191559] [<ffffffc0002c8530>] kernfs_get_open_node.isra.2+0x100/0 x170
[ 62.191562] [<ffffffc0002c8790>] kernfs_fop_open+0x1f0/0x2f8
[ 62.191568] [<ffffffc00024e984>] do_dentry_open+0x21c/0x320
[ 62.191572] [<ffffffc00024ffa4>] vfs_open+0x5c/0x88
[ 62.191576] [<ffffffc000260988>] do_last+0x118/0x710
[ 62.191579] [<ffffffc0002611c8>] path_openat+0x88/0x170
[ 62.191583] [<ffffffc000262428>] do_filp_open+0x48/0xc0
[ 62.191587] [<ffffffc0002503a8>] do_sys_open+0x130/0x218
[ 62.191591] [<ffffffc000250514>] SyS_openat+0x3c/0x50
[ 62.191596] [<ffffffc000084d4c>] __sys_trace_return+0x0/0x4
[ 62.191600]
[ 62.191600] -> #0 (net_if_lock){+.+.+.}:
[ 62.191604] [<ffffffc000101c70>] print_circular_bug+0x78/0x108
[ 62.191608] [<ffffffc000102c94>] check_prev_add+0x36c/0x890
[ 62.191612] [<ffffffc0001032e0>] check_prevs_add+0x128/0x150
[ 62.191616] [<ffffffc000103738>] validate_chain.isra.11+0x430/0x558
[ 62.191619] [<ffffffc000104cd4>] __lock_acquire+0x3b4/0xa48
[ 62.191623] [<ffffffc000105c74>] lock_acquire+0xec/0x268
[ 62.191626] [<ffffffc000c289cc>] _mutex_lock+0x3c/0x50
2457,1
[ 62.191872] [<ffffffbffc1a7bc0>] dhd_open+0x40/0x2b0 [bcmdhd]
[ 62.191880] [<ffffffc000a5bd90>] __dev_open+0xb8/0x128
[ 62.191885] [<ffffffc000a5c084>] __dev_change_flags+0x7c/0x150
[ 62.191889] [<ffffffc000a5c18c>] dev_change_flags+0x34/0x70
[ 62.191892] [<ffffffc000a6cf64>] do_setlink+0x294/0x6f8
[ 62.191896] [<ffffffc000a6f038>] rtnl_newlink+0x3a8/0x688
[ 62.191899] [<ffffffc000a6f3a0>] rtnetlink_rcv_msg+0x88/0x148
[ 62.191905] [<ffffffc000a946f4>] netlink_rcv_skb+0xcc/0xf8
[ 62.191909] [<ffffffc000a6cb2c>] rtnetlink_rcv+0x2c/0x40
[ 62.191913] [<ffffffc000a921f4>] netlink_unicast_kernel+0x5c/0xb0
[ 62.191917] [<ffffffc000a94004>] netlink_unicast+0xd4/0x148
[ 62.191921] [<ffffffc000a94450>] netlink_sendmsg+0x2b0/0x308
[ 62.191926] [<ffffffc000a35470>] sock_sendmsg+0x60/0x70
[ 62.191929] [<ffffffc000a36ddc>] ___sys_sendmsg+0x224/0x238
[ 62.191933] [<ffffffc000a37f68>] __sys_sendmsg+0x50/0x90
[ 62.191937] [<ffffffc000a37fdc>] SyS_sendmsg+0x34/0x48
[ 62.191941] [<ffffffc000084d4c>] __sys_trace_return+0x0/0x4
[ 62.191942]
[ 62.191942] other info that might help us debug this:
[ 62.191942]
[ 62.191944] Possible unsafe locking scenario:
[ 62.191944]
[ 62.191945] CPU0 CPU1
[ 62.191945] ---- ----
[ 62.191948] lock(rtnl_mutex);
[ 62.191950] lock(net_if_lock);
[ 62.191953] lock(rtnl_mutex);
[ 62.191955] lock(net_if_lock);
[ 62.191956]
[ 62.191956] *** DEADLOCK ***
[ 62.191956]
[ 62.191959] 1 lock held by NetworkManager/1148:
[ 62.191966] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffc000a6a544>] rtnl_lock+ 0x1c/0x28
[ 62.191967]
[ 62.191967] stack backtrace:
[ 62.191971] CPU: 3 PID: 1148 Comm: NetworkManager Tainted: G W 4.4.38-rt49-r28.2.1-RedHawk-7.3.3-prt-debug #1
[ 62.191973] Hardware name: quill (DT)
[ 62.191975] Call trace:
[ 62.191979] [<ffffffc00008a920>] dump_backtrace+0x0/0xe0
[ 62.191983] [<ffffffc00008ad04>] show_stack+0x24/0x30
[ 62.191988] [<ffffffc0003687e8>] __dump_stack+0x20/0x28
[ 62.191991] [<ffffffc000368884>] dump_stack+0x94/0xc8
[ 62.191995] [<ffffffc000101cf8>] print_circular_bug+0x100/0x108
[ 62.191998] [<ffffffc000102c94>] check_prev_add+0x36c/0x890
[ 62.192001] [<ffffffc0001032e0>] check_prevs_add+0x128/0x150
[ 62.192004] [<ffffffc000103738>] validate_chain.isra.11+0x430/0x558
[ 62.192007] [<ffffffc000104cd4>] __lock_acquire+0x3b4/0xa48
[ 62.192010] [<ffffffc000105c74>] lock_acquire+0xec/0x268
[ 62.192013] [<ffffffc000c289cc>] _mutex_lock+0x3c/0x50
[ 62.192259] [<ffffffbffc1a7bc0>] dhd_open+0x40/0x2b0 [bcmdhd]
[ 62.192264] [<ffffffc000a5bd90>] __dev_open+0xb8/0x128
[ 62.192267] [<ffffffc000a5c084>] __dev_change_flags+0x7c/0x150
[ 62.192271] [<ffffffc000a5c18c>] dev_change_flags+0x34/0x70
[ 62.192273] [<ffffffc000a6cf64>] do_setlink+0x294/0x6f8
[ 62.192276] [<ffffffc000a6f038>] rtnl_newlink+0x3a8/0x688
[ 62.192278] [<ffffffc000a6f3a0>] rtnetlink_rcv_msg+0x88/0x148
[ 62.192282] [<ffffffc000a946f4>] netlink_rcv_skb+0xcc/0xf8
[ 62.192285] [<ffffffc000a6cb2c>] rtnetlink_rcv+0x2c/0x40
[ 62.192289] [<ffffffc000a921f4>] netlink_unicast_kernel+0x5c/0xb0
[ 62.192292] [<ffffffc000a94004>] netlink_unicast+0xd4/0x148
[ 62.192295] [<ffffffc000a94450>] netlink_sendmsg+0x2b0/0x308
[ 62.192298] [<ffffffc000a35470>] sock_sendmsg+0x60/0x70
[ 62.192301] [<ffffffc000a36ddc>] ___sys_sendmsg+0x224/0x238
[ 62.192304] [<ffffffc000a37f68>] __sys_sendmsg+0x50/0x90
[ 62.192306] [<ffffffc000a37fdc>] SyS_sendmsg+0x34/0x48
[ 62.192309] [<ffffffc000084d4c>] __sys_trace_return+0x0/0x4
[ 62.192339]
[ 62.192339] Dongle Host Driver, version 1.201.82 (r)
[ 62.192339] Compiled in drivers/net/wireless/bcmdhd on Sep 6 2018 at 20:39:00
As can be seen here, a possible ABBA deadlock could be formed. The code to acquire net_if_lock mutex_lock was introduced with NVIDIA commit id I532d51065cfe74b22804565cfa4b5c7aca139e23, if I am not mistaken.
I was able to fix get past this lockdep message with following patch. I am not sure if this is the best way to fix this issue since I am not thoroughly familiar with the code, so it will be great if someone can comment on my fix.
Index: e/kernel/kernel-4.4/drivers/net/wireless/bcmdhd/dhd_linux.c
===================================================================
--- e.orig/kernel/kernel-4.4/drivers/net/wireless/bcmdhd/dhd_linux.c
+++ e/kernel/kernel-4.4/drivers/net/wireless/bcmdhd/dhd_linux.c
@@ -4359,14 +4359,17 @@ dhd_open(struct net_device *net)
int32 ret = 0;
-
-
DHD_OS_WAKE_LOCK(&dhd->pub);
DHD_PERIM_LOCK(&dhd->pub);
dhd->pub.dongle_trap_occured = 0;
dhd->pub.hang_was_sent = 0;
- mutex_lock(&net_if_lock);
+ if (!mutex_trylock(&net_if_lock)) {
+ DHD_ERROR(("%s: failed to acquire net_if_lock lock.\n",
+ __FUNCTION__));
+ goto exit_before_netiflock;
+ }
+
#if !defined(WL_CFG80211)
/*
* Force start if ifconfig_up gets called before START command
@@ -4380,7 +4383,7 @@ dhd_open(struct net_device *net)
goto exit;
}
-#endif
+#endif
ifidx = dhd_net2idx(dhd, net);
DHD_TRACE(("%s: ifidx %d\n", __FUNCTION__, ifidx));
@@ -4515,9 +4518,11 @@ exit:
if (ret)
dhd_stop(net);
+ mutex_unlock(&net_if_lock);
+
+exit_before_netiflock:
DHD_PERIM_UNLOCK(&dhd->pub);
DHD_OS_WAKE_UNLOCK(&dhd->pub);
- mutex_unlock(&net_if_lock);
return ret;
Please consider this as a (possible) bug report because I don’t know if there’s another way to report such L4T kernel related bug reports.