R32.7.1 / 4.9.253-rt168 : INFO: possible circular locking dependency detected (nvpmodel: all_q_mutex + &hp->lock)

Hello,

I am seeing the following INFO message on boot when running a preemptable kernel with LOCKDEP & etc enabled.
Could you please let me know if this is a real issue and whether there is a fix for it?
Thank you

My configuration:

  • Jetson AGX Xavier
  • JetPack 4.6.1
  • custom built r32.7.1 with rt-patches applied (4.9.253-rt168)
  • CONFIG_PREEMPT_RT_FULL=y
  • CONFIG_DEBUG_KERNEL=y
  • CONFIG_LOCKDEP=y
  • CONFIG_PROVE_LOCKING=y
  • CONFIG_DEBUG_SPINLOCK=y
[   47.171157] [ INFO: possible circular locking dependency detected ]
[   47.171171] 4.9.253-rt168 #2 Tainted: G        W      
[   47.171178] -------------------------------------------------------
[   47.171185] nvpmodel/3863 is trying to acquire lock:
[   47.171243]  (all_q_mutex){+.+...}, at: [<ffffff80084ac56c>] blk_mq_queue_reinit_work+0x2c/0x128
[   47.171248] 
[   47.171248] but task is already holding lock:
[   47.171268]  (&hp->lock){+.+...}, at: [<ffffff80080b1720>] takedown_cpu+0x98/0x1b0
[   47.171271] 
[   47.171271] which lock already depends on the new lock.
[   47.171271] 
[   47.171286] 
[   47.171286] the existing dependency chain (in reverse order) is:
[   47.171316] 
[   47.171316] -> #3 (&hp->lock){+.+...}:
[   47.171323]        check_prev_add+0x558/0x6f0
[   47.171326]        validate_chain.isra.19+0xa6c/0xc80
[   47.171329]        __lock_acquire+0x2dc/0x758
[   47.171332]        lock_acquire+0xd4/0x2d0
[   47.171337]        rt_spin_lock__no_mg+0x5c/0x68
[   47.171340]        takedown_cpu+0x98/0x1b0
[   47.171343]        cpuhp_invoke_callback+0xe4/0xd10
[   47.171346]        cpuhp_down_callbacks+0x60/0xb0
[   47.171351]        _cpu_down+0x298/0x3b0
[   47.171354]        do_cpu_down+0x88/0x410
[   47.171357]        cpu_down+0x24/0x30
[   47.171362]        cpu_subsys_offline+0x20/0x30
[   47.171367]        device_offline+0x84/0xd8
[   47.171370]        online_store+0x4c/0xa0
[   47.171374]        dev_attr_store+0x44/0x60
[   47.171378]        sysfs_kf_write+0x54/0x78
[   47.171381]        kernfs_fop_write+0xc0/0x1d8
[   47.171389]        __vfs_write+0x48/0x118
[   47.171392]        vfs_write+0xac/0x1b0
[   47.171395]        SyS_write+0x5c/0xc8
[   47.171400]        el0_svc_naked+0x34/0x38
[   47.171405] 
[   47.171405] -> #2 (cpu_hotplug.lock){+.+.+.}:
[   47.171408]        check_prev_add+0x558/0x6f0
[   47.171412]        validate_chain.isra.19+0xa6c/0xc80
[   47.171414]        __lock_acquire+0x2dc/0x758
[   47.171417]        lock_acquire+0xd4/0x2d0
[   47.171420]        _mutex_lock+0x3c/0x50
[   47.171423]        get_online_cpus+0x78/0xa0
[   47.171426]        __cpuhp_state_add_instance+0x64/0x1b0
[   47.171430]        blk_mq_realloc_hw_ctxs+0x214/0x420
[   47.171433]        blk_mq_init_allocated_queue+0x7c/0x458
[   47.171436]        blk_mq_init_queue+0x3c/0x78
[   47.171440]        loop_add+0xf8/0x2c0
[   47.171446]        loop_init+0x104/0x14c
[   47.171449]        do_one_initcall+0x44/0x130
[   47.171454]        kernel_init_freeable+0x224/0x2c8
[   47.171457]        kernel_init+0x18/0x108
[   47.171460]        ret_from_fork+0x10/0x50
[   47.171464] 
[   47.171464] -> #1 (&q->sysfs_lock){+.+.+.}:
[   47.171468]        check_prev_add+0x558/0x6f0
[   47.171472]        validate_chain.isra.19+0xa6c/0xc80
[   47.171475]        __lock_acquire+0x2dc/0x758
[   47.171478]        lock_acquire+0xd4/0x2d0
[   47.171481]        _mutex_lock+0x3c/0x50
[   47.171484]        blk_mq_map_swqueue+0x3c/0x208
[   47.171487]        blk_mq_init_allocated_queue+0x3f0/0x458
[   47.171489]        blk_mq_init_queue+0x3c/0x78
[   47.171492]        loop_add+0xf8/0x2c0
[   47.171495]        loop_init+0x104/0x14c
[   47.171498]        do_one_initcall+0x44/0x130
[   47.171501]        kernel_init_freeable+0x224/0x2c8
[   47.171504]        kernel_init+0x18/0x108
[   47.171507]        ret_from_fork+0x10/0x50
[   47.171512] 
[   47.171512] -> #0 (all_q_mutex){+.+...}:
[   47.171521]        print_circular_bug+0x78/0x2d8
[   47.171524]        check_prev_add+0xf4/0x6f0
[   47.171527]        validate_chain.isra.19+0xa6c/0xc80
[   47.171530]        __lock_acquire+0x2dc/0x758
[   47.171534]        lock_acquire+0xd4/0x2d0
[   47.171537]        _mutex_lock+0x3c/0x50
[   47.171540]        blk_mq_queue_reinit_work+0x2c/0x128
[   47.171543]        blk_mq_queue_reinit_dead+0x24/0x30
[   47.171546]        cpuhp_invoke_callback+0xe4/0xd10
[   47.171549]        cpuhp_down_callbacks+0x60/0xb0
[   47.171552]        _cpu_down+0x298/0x3b0
[   47.171555]        do_cpu_down+0x88/0x410
[   47.171557]        cpu_down+0x24/0x30
[   47.171561]        cpu_subsys_offline+0x20/0x30
[   47.171564]        device_offline+0x84/0xd8
[   47.171568]        online_store+0x4c/0xa0
[   47.171571]        dev_attr_store+0x44/0x60
[   47.171578]        sysfs_kf_write+0x54/0x78
[   47.171580]        kernfs_fop_write+0xc0/0x1d8
[   47.171584]        __vfs_write+0x48/0x118
[   47.171587]        vfs_write+0xac/0x1b0
[   47.171590]        SyS_write+0x5c/0xc8
[   47.171593]        el0_svc_naked+0x34/0x38
[   47.171594] 
[   47.171594] other info that might help us debug this:
[   47.171594] 
[   47.171600] Chain exists of:
[   47.171600]   all_q_mutex --> cpu_hotplug.lock --> &hp->lock
[   47.171600] 
[   47.171601]  Possible unsafe locking scenario:
[   47.171601] 
[   47.171602]        CPU0                    CPU1
[   47.171603]        ----                    ----
[   47.171607]   lock(&hp->lock);
[   47.171609]                                lock(cpu_hotplug.lock);
[   47.171612]                                lock(&hp->lock);
[   47.171615]   lock(all_q_mutex);
[   47.171616] 
[   47.171616]  *** DEADLOCK ***
[   47.171616] 
[   47.171618] 9 locks held by nvpmodel/3863:
[   47.171626]  #0:  (sb_writers#6){.+.+.+}, at: [<ffffff800829e024>] vfs_write+0x19c/0x1b0
[   47.171633]  #1:  (&of->mutex){+.+.+.}, at: [<ffffff8008347350>] kernfs_fop_write+0x90/0x1d8
[   47.171641]  #2:  (s_active#213){.+.+.+}, at: [<ffffff8008345b7c>] kernfs_get_active+0x24/0x40
[   47.171648]  #3:  (device_hotplug_lock){+.+.+.}, at: [<ffffff8008885100>] lock_device_hotplug_sysfs+0x20/0x50
[   47.171655]  #4:  (&dev->mutex){......}, at: [<ffffff800888657c>] device_offline+0x5c/0xd8
[   47.171662]  #5:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffff80080b4874>] do_cpu_down+0x6c/0x410
[   47.171669]  #6:  (cpu_hotplug.dep_map){++++++}, at: [<ffffff80080b46e8>] cpu_hotplug_begin+0x0/0xc0
[   47.171675]  #7:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffff80080b4760>] cpu_hotplug_begin+0x78/0xc0
[   47.171683]  #8:  (&hp->lock){+.+...}, at: [<ffffff80080b1720>] takedown_cpu+0x98/0x1b0
[   47.171684] 
[   47.171684] stack backtrace:
[   47.171692] CPU: 0 PID: 3863 Comm: nvpmodel Tainted: G        W       4.9.253-rt168 #2
[   47.171693] Hardware name: Jetson-AGX (DT)
[   47.171695] Call trace:
[   47.171700] [<ffffff800808c2b8>] dump_backtrace+0x0/0x1a8
[   47.171704] [<ffffff800808c964>] show_stack+0x24/0x30
[   47.171707] [<ffffff80090ea3c0>] dump_stack+0xa0/0xd0
[   47.171710] [<ffffff8008129dac>] print_circular_bug+0x1cc/0x2d8
[   47.171713] [<ffffff800812ad7c>] check_prev_add+0xf4/0x6f0
[   47.171717] [<ffffff800812bde4>] validate_chain.isra.19+0xa6c/0xc80
[   47.171720] [<ffffff800812d044>] __lock_acquire+0x2dc/0x758
[   47.171723] [<ffffff800812d9bc>] lock_acquire+0xd4/0x2d0
[   47.171726] [<ffffff80090f6a14>] _mutex_lock+0x3c/0x50
[   47.171729] [<ffffff80084ac56c>] blk_mq_queue_reinit_work+0x2c/0x128
[   47.171732] [<ffffff80084ac68c>] blk_mq_queue_reinit_dead+0x24/0x30
[   47.171735] [<ffffff80080b1d2c>] cpuhp_invoke_callback+0xe4/0xd10
[   47.171738] [<ffffff80080b3408>] cpuhp_down_callbacks+0x60/0xb0
[   47.171741] [<ffffff80090eda70>] _cpu_down+0x298/0x3b0
[   47.171744] [<ffffff80080b4890>] do_cpu_down+0x88/0x410
[   47.171747] [<ffffff80080b4c3c>] cpu_down+0x24/0x30
[   47.171750] [<ffffff800888d488>] cpu_subsys_offline+0x20/0x30
[   47.171753] [<ffffff80088865a4>] device_offline+0x84/0xd8
[   47.171757] [<ffffff80088866fc>] online_store+0x4c/0xa0
[   47.171760] [<ffffff8008883254>] dev_attr_store+0x44/0x60
[   47.171763] [<ffffff8008348404>] sysfs_kf_write+0x54/0x78
[   47.171765] [<ffffff8008347380>] kernfs_fop_write+0xc0/0x1d8
[   47.171768] [<ffffff800829cf40>] __vfs_write+0x48/0x118
[   47.171771] [<ffffff800829df34>] vfs_write+0xac/0x1b0
[   47.171774] [<ffffff800829f5bc>] SyS_write+0x5c/0xc8
[   47.171777] [<ffffff8008083140>] el0_svc_naked+0x34/0x38

Sorry for the late response, our team will do the investigation and provide suggestions soon. Thanks

Could you try after disabling the nvpmodel service and share if getting this issue. One way to disable it is by renaming the bin file and then reboot.
/home/ubuntu# mv /usr/sbin/nvpmodel /usr/sbin/nvpmodel_bkp

after renaming nvpmodel binary, the original message no longer show up (note that this does not resolve a potential deadlock in the kernel though).

However, another one took its place:

Looks like it just alternates between these two issues depending on which one shows up first.
Please see the logs and the full config attached.

Thank you.

20220802-JP461-bootlog.tgz (59.4 KB)
config-4.9.253-rt168.gz (37.1 KB)

nvhost recursive locking is a different issue and might be false positive. Could you share if you get the original issue reported in this thread after disabling nvpmodel.

As stated in my previous post above:

after renaming nvpmodel binary, the original message no longer show up (note that this does not resolve a potential deadlock in the kernel though).

CPU hotplug in 4.9-rt is not stable and can result in deadlocks.
‘nvpmodel’ does hotplug power off for some of the cores at end of booting where the possible lockup seems to be reporting.

Please refer: Re: [v4.9-rt][report] stress-ng: possible circular locking dependency detected (Linux Real Time)
Some of possible deadlocks already reported: [v4.9-rt][report] stress-ng: possible circular locking dependency detected (Linux Real Time)

Would it be best to disable CPU hotplug in .config and build the kernel without it?

Thank you.

Not sure if we should disable it in the default defconfig for Jetson. Will check about it.
Could you disable ‘CONFIG_HOTPLUG_CPU’ and hopefully the issue shouldn’t come anymore.

Turns out PM_SLEEP_SMP selects HOTPLUG_CPU=y so we’d have to disable more than just HOTPLUG_CPU, which could have issues of its own in our specific case.

Thank you.