On a RedHawk’s debug kernel, running jetson_clocks --fan command shows below backtrace with a segmentation fault. If I try to run jetson_clocks --show after this fault, the system hangs and requires a power cycle.
root@cyclops:~# jetson_clocks --fan
[ 38.010631] Unable to handle kernel NULL pointer dereference at virtual address 00000018
[ 38.018623] Mem abort info:
[ 38.021750] ESR = 0x96000005
[ 38.024621] Exception class = DABT (current EL), IL = 32 bits
[ 38.030657] SET = 0, FnV = 0
[ 38.033889] EA = 0, S1PTW = 0
[ 38.037040] Data abort info:
[ 38.039837] ISV = 0, ISS = 0x00000005
[ 38.043781] CM = 0, WnR = 0
[ 38.046840] user pgtable: 4k pages, 39-bit VAs, pgd = ffffffc3bcdf8000
[ 38.053338] [0000000000000018] *pgd=0000000000000000, *pud=0000000000000000
[ 38.060575] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[ 38.066098] Modules linked in:
[ 38.069072] CPU: 4 PID: 7419 Comm: jetson_clocks Tainted: G W 4.9.201-rt134-r32.5.1-tegra-RedHawk-7.5.5-r629-nvidia-...-oops #18
[ 38.082011] Hardware name: Jetson-AGX (DT)
[ 38.085954] task: ffffffc3bdd64f00 task.stack: ffffffc3d09e0000
[ 38.091821] PC is at debug_mutex_wake_waiter+0x50/0x150
[ 38.096983] LR is at __mutex_unlock_slowpath+0xdc/0x1d8
[ 38.102140] pc : [<ffffff8008142930>] lr : [<ffffff8009114adc>] pstate: a04001c5
[ 38.109752] sp : ffffffc3d09e3be0
[ 38.113161] x29: ffffffc3d09e3be0 x28: ffffffc3bdd64f00
[ 38.118679] x27: ffffff8009132000 x26: 0000000000000040
[ 38.124278] x25: 00000000000001af x24: 0000000000000015
[ 38.130134] x23: 0000000000000140 x22: ffffff800b849000
[ 38.135646] x21: ffffff800b849000 x20: 0000000000000000
[ 38.141167] x19: ffffffc3eb4f5db8 x18: ffffffffffffffff
[ 38.146941] x17: 0000007fb7ac0058 x16: ffffff80082c5280
[ 38.152797] x15: ffffff800a199fd0 x14: 3066343664646233
[ 38.158484] x13: 6366666666666620 x12: 3a746e6572727563
[ 38.164259] x11: 202d203333313a6d x10: 00000000000004b9
[ 38.169782] x9 : 745f657461647075 x8 : ffffff8008444cb8
[ 38.175810] x7 : 0000000000000001 x6 : ffffffc3ffc8b5a8
[ 38.181322] x5 : ffffffc3ffc8b5a8 x4 : 00000043f619d000
[ 38.186404] x3 : ffffffc3eb4f5dc0 x2 : 0000000000000000
[ 38.191996] x1 : 0000000000000000 x0 : ffffffc3eb4f5df8
[ 38.197351]
[ 38.198737] Process jetson_clocks (pid: 7419, stack limit = 0xffffffc3d09e0000)
[ 38.205648] Call trace:
[ 38.208020] [<ffffff8008142930>] debug_mutex_wake_waiter+0x50/0x150
[ 38.213620] [<ffffff8009114adc>] __mutex_unlock_slowpath+0xdc/0x1d8
[ 38.219479] [<ffffff8009114bf8>] mutex_unlock+0x20/0x30
[ 38.224552] [<ffffff8008d1fe08>] fan_update_target_pwm+0xd0/0x2a0
[ 38.230413] [<ffffff8008d20170>] fan_target_pwm_store+0x80/0xb0
[ 38.235758] [<ffffff80089156fc>] dev_attr_store+0x44/0x60
[ 38.241091] [<ffffff8008353c54>] sysfs_kf_write+0x54/0x78
[ 38.246163] [<ffffff8008352c48>] kernfs_fop_write+0xc8/0x1e0
[ 38.251244] [<ffffff80082c33c0>] __vfs_write+0x48/0x118
[ 38.256055] [<ffffff80082c4144>] vfs_write+0xac/0x1b0
[ 38.261038] [<ffffff80082c52ec>] SyS_write+0x6c/0xf8
[ 38.265768] [<ffffff80080833dc>] __sys_trace_return+0x0/0x4
[ 38.271628] ---[ end trace 27f8a1a616fb2f93 ]---
Segmentation fault
root@cyclops:~#
To get the above backtrace, a kernel with CONFIG_DEBUG_MUTEXES need to be built. Hence, it can’t be seen on a kernel shipped with the JetPack as this option is not enabled.
Also, noticed that the issue is not present on a PREEMPT_RT kernel with this option enabled (prt-debug flavor of RedHawk) as the mutex_is_locked() is never set. This function is different on a PREEMPT_RT and non-PREEMPT_RT kernel.
The offending snippet of the code is in nvidia/drivers/thermal/pwm_fan.c at:
static void fan_update_target_pwm(struct fan_dev_data *fan_data, int val)
{
fan_data->next_target_pwm = min(val, fan_data->fan_cap_pwm);
/* If a new pwm update request, reset the lock sequence */
if (mutex_is_locked(&fan_data->pwm_set))
mutex_unlock(&fan_data->pwm_set);
...
The mutex_is_lockded() function for non-PREEMPT_RT kernel is defined as :
/**
* mutex_is_locked - is the mutex locked
* @lock: the mutex to be queried
*
* Returns 1 if the mutex is locked, 0 if unlocked.
*/
static inline int mutex_is_locked(struct mutex *lock)
{
return atomic_read(&lock->count) != 1;
}
I also looked at the ownership of the lock and it appears that the current is different than the process that owns the lock. Please find the debug printks below:
root@cyclops:~# dmesg |grep curr
[ 40.870718] pwm_fan_driver pwm-fan: fan_update_target_pwm:133 - current: ffffffc3b7dc27c0, owner(fan_data->pwm_set): (null)
[ 198.395733] pwm_fan_driver pwm-fan: fan_update_target_pwm:133 - current: ffffffc3c07327c0, owner(fan_data->pwm_set): ffffffc3b7dc27c0
root@cyclops:~#
The first line occurred when I ran jetson_clocks --fan after a fresh reboot, and the second one occurred after re-running the command. As can be seen above that the current and owner are different; and the owner in the 2nd line is the one from the first run of the command.
I am hoping if there’s a patch that I can apply here to not get the above segfault.