Segfault running 'jetson_clocks --fan' on a kernel built with CONFIG_DEBUG_MUTEXES

AnishAney · May 5, 2021, 9:20pm

On a RedHawk’s debug kernel, running jetson_clocks --fan command shows below backtrace with a segmentation fault. If I try to run jetson_clocks --show after this fault, the system hangs and requires a power cycle.

root@cyclops:~# jetson_clocks --fan
[   38.010631] Unable to handle kernel NULL pointer dereference at virtual address 00000018
[   38.018623] Mem abort info:
[   38.021750]   ESR = 0x96000005
[   38.024621]   Exception class = DABT (current EL), IL = 32 bits
[   38.030657]   SET = 0, FnV = 0
[   38.033889]   EA = 0, S1PTW = 0
[   38.037040] Data abort info:
[   38.039837]   ISV = 0, ISS = 0x00000005
[   38.043781]   CM = 0, WnR = 0
[   38.046840] user pgtable: 4k pages, 39-bit VAs, pgd = ffffffc3bcdf8000
[   38.053338] [0000000000000018] *pgd=0000000000000000, *pud=0000000000000000
[   38.060575] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[   38.066098] Modules linked in:
[   38.069072] CPU: 4 PID: 7419 Comm: jetson_clocks Tainted: G        W       4.9.201-rt134-r32.5.1-tegra-RedHawk-7.5.5-r629-nvidia-...-oops #18
[   38.082011] Hardware name: Jetson-AGX (DT)
[   38.085954] task: ffffffc3bdd64f00 task.stack: ffffffc3d09e0000
[   38.091821] PC is at debug_mutex_wake_waiter+0x50/0x150
[   38.096983] LR is at __mutex_unlock_slowpath+0xdc/0x1d8
[   38.102140] pc : [<ffffff8008142930>] lr : [<ffffff8009114adc>] pstate: a04001c5
[   38.109752] sp : ffffffc3d09e3be0
[   38.113161] x29: ffffffc3d09e3be0 x28: ffffffc3bdd64f00 
[   38.118679] x27: ffffff8009132000 x26: 0000000000000040 
[   38.124278] x25: 00000000000001af x24: 0000000000000015 
[   38.130134] x23: 0000000000000140 x22: ffffff800b849000 
[   38.135646] x21: ffffff800b849000 x20: 0000000000000000 
[   38.141167] x19: ffffffc3eb4f5db8 x18: ffffffffffffffff 
[   38.146941] x17: 0000007fb7ac0058 x16: ffffff80082c5280 
[   38.152797] x15: ffffff800a199fd0 x14: 3066343664646233 
[   38.158484] x13: 6366666666666620 x12: 3a746e6572727563 
[   38.164259] x11: 202d203333313a6d x10: 00000000000004b9 
[   38.169782] x9 : 745f657461647075 x8 : ffffff8008444cb8 
[   38.175810] x7 : 0000000000000001 x6 : ffffffc3ffc8b5a8 
[   38.181322] x5 : ffffffc3ffc8b5a8 x4 : 00000043f619d000 
[   38.186404] x3 : ffffffc3eb4f5dc0 x2 : 0000000000000000 
[   38.191996] x1 : 0000000000000000 x0 : ffffffc3eb4f5df8 
[   38.197351] 
[   38.198737] Process jetson_clocks (pid: 7419, stack limit = 0xffffffc3d09e0000)
[   38.205648] Call trace:
[   38.208020] [<ffffff8008142930>] debug_mutex_wake_waiter+0x50/0x150
[   38.213620] [<ffffff8009114adc>] __mutex_unlock_slowpath+0xdc/0x1d8
[   38.219479] [<ffffff8009114bf8>] mutex_unlock+0x20/0x30
[   38.224552] [<ffffff8008d1fe08>] fan_update_target_pwm+0xd0/0x2a0
[   38.230413] [<ffffff8008d20170>] fan_target_pwm_store+0x80/0xb0
[   38.235758] [<ffffff80089156fc>] dev_attr_store+0x44/0x60
[   38.241091] [<ffffff8008353c54>] sysfs_kf_write+0x54/0x78
[   38.246163] [<ffffff8008352c48>] kernfs_fop_write+0xc8/0x1e0
[   38.251244] [<ffffff80082c33c0>] __vfs_write+0x48/0x118
[   38.256055] [<ffffff80082c4144>] vfs_write+0xac/0x1b0
[   38.261038] [<ffffff80082c52ec>] SyS_write+0x6c/0xf8
[   38.265768] [<ffffff80080833dc>] __sys_trace_return+0x0/0x4
[   38.271628] ---[ end trace 27f8a1a616fb2f93 ]---
Segmentation fault
root@cyclops:~#

To get the above backtrace, a kernel with CONFIG_DEBUG_MUTEXES need to be built. Hence, it can’t be seen on a kernel shipped with the JetPack as this option is not enabled.

Also, noticed that the issue is not present on a PREEMPT_RT kernel with this option enabled (prt-debug flavor of RedHawk) as the mutex_is_locked() is never set. This function is different on a PREEMPT_RT and non-PREEMPT_RT kernel.

The offending snippet of the code is in nvidia/drivers/thermal/pwm_fan.c at:

static void fan_update_target_pwm(struct fan_dev_data *fan_data, int val)
{
                fan_data->next_target_pwm = min(val, fan_data->fan_cap_pwm);
 
                /* If a new pwm update request, reset the lock sequence */
               if (mutex_is_locked(&fan_data->pwm_set))
                       mutex_unlock(&fan_data->pwm_set);
               ...

The mutex_is_lockded() function for non-PREEMPT_RT kernel is defined as :

/**
 * mutex_is_locked - is the mutex locked
 * @lock: the mutex to be queried
 *
 * Returns 1 if the mutex is locked, 0 if unlocked.
 */
static inline int mutex_is_locked(struct mutex *lock)
{
	return atomic_read(&lock->count) != 1;
}

I also looked at the ownership of the lock and it appears that the current is different than the process that owns the lock. Please find the debug printks below:

root@cyclops:~# dmesg |grep curr
[   40.870718] pwm_fan_driver pwm-fan: fan_update_target_pwm:133 - current: ffffffc3b7dc27c0, owner(fan_data->pwm_set):           (null)
[  198.395733] pwm_fan_driver pwm-fan: fan_update_target_pwm:133 - current: ffffffc3c07327c0, owner(fan_data->pwm_set): ffffffc3b7dc27c0
root@cyclops:~#

The first line occurred when I ran jetson_clocks --fan after a fresh reboot, and the second one occurred after re-running the command. As can be seen above that the current and owner are different; and the owner in the 2nd line is the one from the first run of the command.

I am hoping if there’s a patch that I can apply here to not get the above segfault.

JerryChang · May 6, 2021, 8:08am

[EDIT] new patches to fix kernel building failure.

hello AnishAney,

it’s a possible MUTEX LOCK issue in the pwm_fan driver,
please apply the patches for verification, Topic177104_May07_pwm-fan.zip (3.0 KB)
it’s based-on JetPack-4.5 / l4t-r32.5 release version.
thanks

AnishAney · May 6, 2021, 12:29pm

Hi,

Thank you for sharing the patches. I have applied all the three patches. However, I had a compilation error after applying the 0003-pwm-fan-fix-deadlock-due-to-incorrect-locking.patch patch.

The error was that the variable time_offmay not be initialized at line 666 of the file nvidia/drivers/thermal/pwm_fan.c. So, I initialized the variable to time_off = fan_data->rpm_invalid_retry_delay; as has been done in other places. Please let me know if this is a correct thing to do.

AnishAney · May 6, 2021, 1:09pm

BTW, applying the patches fixed the issue. I no longer see the segfault/backtrace.

Please confirm if my fix is correct to the time_off variable, so that I can patch the kernel.

JerryChang · May 7, 2021, 1:48am

hello AnishAney,

thanks for confirmation, this change is still under code-review stage.

you may have the fix to resolve building failures. i.e. time_off = fan_data->rpm_invalid_retry_delay;
I’ll also update the patch in my previous comments, thanks

AnishAney · May 7, 2021, 8:29pm

Yes, but since it may the system behavior while it’s running, I will go ahead and apply the patch for now.

JerryChang · May 26, 2021, 5:18am

hello AnishAney,

FYI,
we had check-in the changes to fix deadlock due to incorrect locking.
please expect this will include for next public release, i.e. JetPack-4.6
thanks

Topic		Replies	Views
RT patches caused system crash Jetson TX2	4	1320	October 18, 2021
pmu_enable_hw: Falcon mem scrubbing timeout Jetson TX2	61	3116	October 18, 2021
RT patches leading to warning and ultimately deadlocked system Jetson TX2 kernel	12	1247	October 18, 2021
SCHED_FIFO thread hanging Argus/MMApi with Jetpack 3.1 Jetson TX2	4	858	October 18, 2021
L4T 28.1 kernel lockup/crash Jetson TX2	25	3816	September 20, 2017
Kernel message: "Attempted to yield cpu" Linux	1	1639	October 30, 2013
persistent kernel causes driver to complain "cannot idle engine 0" and then cease function Jetson TX1	5	1443	December 8, 2016
S870 causes kernel panic Device query of S870 crashes kernel CUDA Programming and Performance	27	25742	May 29, 2008
PWM in Tegra Jetson TK1 not giving any pulse. Jetson TK1	50	10532	March 21, 2017
Can't run jetson_clocks.sh Jetson TX2	4	2244	October 18, 2021

Segfault running 'jetson_clocks --fan' on a kernel built with CONFIG_DEBUG_MUTEXES

Related topics