Segfault running 'jetson_clocks --fan' on a kernel built with CONFIG_DEBUG_MUTEXES

On a RedHawk’s debug kernel, running jetson_clocks --fan command shows below backtrace with a segmentation fault. If I try to run jetson_clocks --show after this fault, the system hangs and requires a power cycle.

root@cyclops:~# jetson_clocks --fan
[   38.010631] Unable to handle kernel NULL pointer dereference at virtual address 00000018
[   38.018623] Mem abort info:
[   38.021750]   ESR = 0x96000005
[   38.024621]   Exception class = DABT (current EL), IL = 32 bits
[   38.030657]   SET = 0, FnV = 0
[   38.033889]   EA = 0, S1PTW = 0
[   38.037040] Data abort info:
[   38.039837]   ISV = 0, ISS = 0x00000005
[   38.043781]   CM = 0, WnR = 0
[   38.046840] user pgtable: 4k pages, 39-bit VAs, pgd = ffffffc3bcdf8000
[   38.053338] [0000000000000018] *pgd=0000000000000000, *pud=0000000000000000
[   38.060575] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[   38.066098] Modules linked in:
[   38.069072] CPU: 4 PID: 7419 Comm: jetson_clocks Tainted: G        W       4.9.201-rt134-r32.5.1-tegra-RedHawk-7.5.5-r629-nvidia-...-oops #18
[   38.082011] Hardware name: Jetson-AGX (DT)
[   38.085954] task: ffffffc3bdd64f00 task.stack: ffffffc3d09e0000
[   38.091821] PC is at debug_mutex_wake_waiter+0x50/0x150
[   38.096983] LR is at __mutex_unlock_slowpath+0xdc/0x1d8
[   38.102140] pc : [<ffffff8008142930>] lr : [<ffffff8009114adc>] pstate: a04001c5
[   38.109752] sp : ffffffc3d09e3be0
[   38.113161] x29: ffffffc3d09e3be0 x28: ffffffc3bdd64f00 
[   38.118679] x27: ffffff8009132000 x26: 0000000000000040 
[   38.124278] x25: 00000000000001af x24: 0000000000000015 
[   38.130134] x23: 0000000000000140 x22: ffffff800b849000 
[   38.135646] x21: ffffff800b849000 x20: 0000000000000000 
[   38.141167] x19: ffffffc3eb4f5db8 x18: ffffffffffffffff 
[   38.146941] x17: 0000007fb7ac0058 x16: ffffff80082c5280 
[   38.152797] x15: ffffff800a199fd0 x14: 3066343664646233 
[   38.158484] x13: 6366666666666620 x12: 3a746e6572727563 
[   38.164259] x11: 202d203333313a6d x10: 00000000000004b9 
[   38.169782] x9 : 745f657461647075 x8 : ffffff8008444cb8 
[   38.175810] x7 : 0000000000000001 x6 : ffffffc3ffc8b5a8 
[   38.181322] x5 : ffffffc3ffc8b5a8 x4 : 00000043f619d000 
[   38.186404] x3 : ffffffc3eb4f5dc0 x2 : 0000000000000000 
[   38.191996] x1 : 0000000000000000 x0 : ffffffc3eb4f5df8 
[   38.197351] 
[   38.198737] Process jetson_clocks (pid: 7419, stack limit = 0xffffffc3d09e0000)
[   38.205648] Call trace:
[   38.208020] [<ffffff8008142930>] debug_mutex_wake_waiter+0x50/0x150
[   38.213620] [<ffffff8009114adc>] __mutex_unlock_slowpath+0xdc/0x1d8
[   38.219479] [<ffffff8009114bf8>] mutex_unlock+0x20/0x30
[   38.224552] [<ffffff8008d1fe08>] fan_update_target_pwm+0xd0/0x2a0
[   38.230413] [<ffffff8008d20170>] fan_target_pwm_store+0x80/0xb0
[   38.235758] [<ffffff80089156fc>] dev_attr_store+0x44/0x60
[   38.241091] [<ffffff8008353c54>] sysfs_kf_write+0x54/0x78
[   38.246163] [<ffffff8008352c48>] kernfs_fop_write+0xc8/0x1e0
[   38.251244] [<ffffff80082c33c0>] __vfs_write+0x48/0x118
[   38.256055] [<ffffff80082c4144>] vfs_write+0xac/0x1b0
[   38.261038] [<ffffff80082c52ec>] SyS_write+0x6c/0xf8
[   38.265768] [<ffffff80080833dc>] __sys_trace_return+0x0/0x4
[   38.271628] ---[ end trace 27f8a1a616fb2f93 ]---
Segmentation fault
root@cyclops:~#

To get the above backtrace, a kernel with CONFIG_DEBUG_MUTEXES need to be built. Hence, it can’t be seen on a kernel shipped with the JetPack as this option is not enabled.

Also, noticed that the issue is not present on a PREEMPT_RT kernel with this option enabled (prt-debug flavor of RedHawk) as the mutex_is_locked() is never set. This function is different on a PREEMPT_RT and non-PREEMPT_RT kernel.

The offending snippet of the code is in nvidia/drivers/thermal/pwm_fan.c at:

static void fan_update_target_pwm(struct fan_dev_data *fan_data, int val)
{
                fan_data->next_target_pwm = min(val, fan_data->fan_cap_pwm);
 
                /* If a new pwm update request, reset the lock sequence */
               if (mutex_is_locked(&fan_data->pwm_set))
                       mutex_unlock(&fan_data->pwm_set);
               ...

The mutex_is_lockded() function for non-PREEMPT_RT kernel is defined as :

/**
 * mutex_is_locked - is the mutex locked
 * @lock: the mutex to be queried
 *
 * Returns 1 if the mutex is locked, 0 if unlocked.
 */
static inline int mutex_is_locked(struct mutex *lock)
{
	return atomic_read(&lock->count) != 1;
}

I also looked at the ownership of the lock and it appears that the current is different than the process that owns the lock. Please find the debug printks below:

root@cyclops:~# dmesg |grep curr
[   40.870718] pwm_fan_driver pwm-fan: fan_update_target_pwm:133 - current: ffffffc3b7dc27c0, owner(fan_data->pwm_set):           (null)
[  198.395733] pwm_fan_driver pwm-fan: fan_update_target_pwm:133 - current: ffffffc3c07327c0, owner(fan_data->pwm_set): ffffffc3b7dc27c0
root@cyclops:~#

The first line occurred when I ran jetson_clocks --fan after a fresh reboot, and the second one occurred after re-running the command. As can be seen above that the current and owner are different; and the owner in the 2nd line is the one from the first run of the command.

I am hoping if there’s a patch that I can apply here to not get the above segfault.

[EDIT] new patches to fix kernel building failure.

hello AnishAney,

it’s a possible MUTEX LOCK issue in the pwm_fan driver,
please apply the patches for verification, Topic177104_May07_pwm-fan.zip (3.0 KB)
it’s based-on JetPack-4.5 / l4t-r32.5 release version.
thanks

Hi,

Thank you for sharing the patches. I have applied all the three patches. However, I had a compilation error after applying the 0003-pwm-fan-fix-deadlock-due-to-incorrect-locking.patch patch.

The error was that the variable time_offmay not be initialized at line 666 of the file nvidia/drivers/thermal/pwm_fan.c. So, I initialized the variable to time_off = fan_data->rpm_invalid_retry_delay; as has been done in other places. Please let me know if this is a correct thing to do.

BTW, applying the patches fixed the issue. I no longer see the segfault/backtrace.

Please confirm if my fix is correct to the time_off variable, so that I can patch the kernel.

hello AnishAney,

thanks for confirmation, this change is still under code-review stage.

you may have the fix to resolve building failures. i.e. time_off = fan_data->rpm_invalid_retry_delay;
I’ll also update the patch in my previous comments, thanks

1 Like

Yes, but since it may the system behavior while it’s running, I will go ahead and apply the patch for now.

hello AnishAney,

FYI,
we had check-in the changes to fix deadlock due to incorrect locking.
please expect this will include for next public release, i.e. JetPack-4.6
thanks

1 Like