Real time patch set linux driver BUG()

I’m running into an issue while attempting to use the 418.67 display driver with a custom kernel running the real time patch set. I’m intermittently hitting a “BUG: scheduling while atomic” from one of the nvidia isr threads.

I’m aware that this is not exactly an officially supported configuration. Nonetheless, it’s still important to me, and I’d appreciate help resolving the issue, as I’m likely not the only person running the real time patch set with cuda.

The kernel being used is essentially taken stock from rt stable releases here: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git.

The release branch used is rather old, v4.19.90-rt35. CONFIG_PREEMPT_RT_FULL=y is being used.

The driver is fairly old as well, it’s the driver release that was packaged with a prior cuda 10.1.168 release.

$ md5sum ./NVIDIA-Linux-x86_64-418.67.run
662865d9a7b5ef1ac3402e098a5fb91f ./NVIDIA-Linux-x86_64-418.67.run

I get the backtrace while running a cuda app that runs kernels and uses the video codec sdk.

[ 5493.745489] BUG: scheduling while atomic: irq/216-s-nvidi/9203/0x00000002
[ 5493.745490] Modules linked in: nvidia_uvm(O) sch_fq ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat bpfilter intel_rapl x86_pkg_temp_thermal intel_powerclamp crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel nvidia_drm(PO) aes_x86_64 crypto_simd nvidia_modeset(PO) cryptd glue_helper nvidia(PO) ast ttm lpc_ich mfd_core mei_me mei ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler ftdi_sio usbserial cdc_acm
[ 5493.745507] Preemption disabled at:
[ 5493.745508] [<0000000000000000>] (null)
[ 5493.745511] CPU: 37 PID: 9203 Comm: irq/216-s-nvidi Tainted: P W O 4.19.90-rt35-app #1
[ 5493.745512] Hardware name: Intel Corporation S2600BPB/S2600BPB, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
[ 5493.745512] Call Trace:
[ 5493.745524] dump_stack+0x68/0x9b
[ 5493.745528] __schedule_bug+0x71/0xc0
[ 5493.745530] __schedule+0x56d/0x680
[ 5493.745531] schedule+0x4c/0xf0
[ 5493.745534] rt_spin_lock_slowlock_locked+0x108/0x2d0
[ 5493.745536] rt_spin_lock_slowlock+0x50/0x80
[ 5493.745541] __wake_up_common_lock+0x61/0xb0
[ 5493.745758] _nv029653rm+0x147/0x180 [nvidia]
[ 5493.745966] ? _nv015202rm+0x62/0x140 [nvidia]
[ 5493.746121] ? _nv018379rm+0x70/0x190 [nvidia]
[ 5493.746312] ? _nv024595rm+0x1cf/0x5e0 [nvidia]
[ 5493.746424] ? _nv000938rm+0xe8/0x150 [nvidia]
[ 5493.746429] ? irq_forced_thread_fn+0x70/0x70
[ 5493.746540] ? rm_isr_bh+0x1c/0x60 [nvidia]
[ 5493.746606] ? nvidia_isr_common_bh+0x62/0x70 [nvidia]
[ 5493.746608] ? irq_thread_fn+0x1b/0x60
[ 5493.746609] ? irq_thread+0x12e/0x1b0
[ 5493.746610] ? preempt_count_sub+0x94/0xe0
[ 5493.746611] ? wake_threads_waitq+0x30/0x30
[ 5493.746614] ? kthread+0xf5/0x130
[ 5493.746615] ? irq_thread_check_affinity+0x80/0x80
[ 5493.746616] ? kthread_bind+0x30/0x30
[ 5493.746619] ? ret_from_fork+0x24/0x30

The last listed call site from the nvidia driver before it lands in kernel code is [_nv029653rm+0x147]

This symbol is in the binary component of the nvidia driver.

$ objdump -d ./kernel/nvidia/nv-kernel.o_binary | grep " <_nv029653rm>:"
0000000000704750 <_nv029653rm>:

Closely looking at the last call site reveals a little more:

$ objdump -d ./kernel/nvidia/nv-kernel.o_binary | grep -C 1 704897
70488f: 4c 89 f7 mov %r14,%rdi
704892: e8 00 00 00 00 callq 704897 <_nv029653rm+0x147>. # 4 byte reloc here.
704897: 31 c0 xor %eax,%eax
704899: e9 d5 fe ff ff jmpq 704773 <_nv029653rm+0x23>

The actual final callsite is likely nv_post_event:

objdump -r ./kernel/nvidia/nv-kernel.o_binary | grep 704893
0000000000704893 R_X86_64_PC32 nv_post_event-0x0000000000000004 # call to nv_post_event

nv_post_event from the open source component of the driver is as follows:

void NV_API_CALL nv_post_event(
nv_state_t *nv,
nv_event_t *event,
NvHandle handle,
NvU32 index,
NvBool data_valid
)
{
nv_file_private_t *nvfp = event->file;
unsigned long eflags;
nvidia_event_t *nvet;

NV_SPIN_LOCK_IRQSAVE(&nvfp->fp_lock, eflags);

if (data_valid)
{
    NV_KMALLOC_ATOMIC(nvet, sizeof(nvidia_event_t));
    if (nvet == NULL)
    {
        NV_SPIN_UNLOCK_IRQRESTORE(&nvfp->fp_lock, eflags);
        return;
    }

    if (nvfp->event_tail != NULL)
        nvfp->event_tail->next = nvet;
    if (nvfp->event_head == NULL)
        nvfp->event_head = nvet;
    nvfp->event_tail = nvet;
    nvet->next = NULL;

    nvet->event = *event;
    nvet->event.hObject = handle;
    nvet->event.index = index;
}

nvfp->event_pending = TRUE;

NV_SPIN_UNLOCK_IRQRESTORE(&nvfp->fp_lock, eflags);

wake_up_interruptible(&nvfp->waitqueue);

}

On PREEMPT_RT_FULL, NV_SPIN_LOCK_IRQSAVE cannot block (as it is a raw_spin_lock), so it’s very likely the call to wake_up_interruptible is to blame.

The core of wake_up_interruptible immediately contains a call to spin_lock_irq. On PREEMPT_RT_FULL kernels, spin locks are replaced
with blocking priority inheriting mutexes. The last part of the stack trace shows blocking on just that:

[ 5493.745534] rt_spin_lock_slowlock_locked+0x108/0x2d0
[ 5493.745536] rt_spin_lock_slowlock+0x50/0x80
[ 5493.745541] __wake_up_common_lock+0x61/0xb0

Are there any modifications that can be made to the open source layer to avoid the call to wake_up_interruptible? Can the call be deferred to a preemptable context with rcu? Can nvidia change binary component of the kernel driver to avoid calls to nv_post_event in non-preemptable contexts? I’ve repeated the same tests with a recent R450 release, and find the same issue, albeit with a different set of mangled binary symbol names on the backtrace.

Am I correct in my analysis of the bug? Any help is appreciated.