NVIDIA driver crashing during video-transcoding

Hello NVidia forums!

I’ve been having trouble for a few days with my nvidia driver crashing at seemingly random when used to do nvidia-ffmpeg-transcoding ( including scale_npp).
the development env is: CentOS8.0/ CUDA 11.6/GPG driver 510.47.03/ffmpeg4.2.1
The interesting bit is below, I really need NVdia expert’s help to find out the root cause for this issue, I’m looking forward to your feedback, if any more information required ,pls be feel free to let me know:)


Warning: error at alloc_pipe_info, unprivileged user reached user_pages_soft limit
[20149.780717] alloc_pipe_info: 7352 callbacks suppressed
[20149.780719] pid=2673039 cmd=redis_pid.sh user_bufs=16519 too_many_pipe_buffers_soft
Warning: error at alloc_pipe_info, unprivileged user reached user_pages_soft limit
[20195.971634] ubp_svcd[2480573]: segfault at 1cd8010 ip 0000000001cd8010 sp 00007ffd00df2c78 error 15
[20195.971640] Code: Bad RIP value.
[20205.403829] BUG: unable to handle kernel NULL pointer dereference at 00000000000001a0
[20205.403856] PGD 0 P4D 0
[20205.403867] Oops: 0000 [#1] SMP NOPTI
[20205.403880] CPU: 4 PID: 2480573 Comm: ubp_svcd Tainted: P OE --------- - - 4.18.0-147.5.2.5.h781.eulerosv2r10.x86_64 #1
[20205.403910] Hardware name: /0YWR7D, BIOS 2.12.2 07/09/2021
[20205.404226] RIP: 0010:_nv028124rm+0x4c1/0x5b0 [nvidia]
[20205.404243] Code: 05 33 01 00 84 c0 75 0a 41 80 bd 03 05 00 00 00 74 c7 49 8b 85 c0 1a 00 00 4c 89 ef 41 b8 01 00 00 00 b9 01 00 00 00 48 89 da <48> 8b 80 a0 01 00 00 48 8b b0 a0 01 00 00 e8 cc ca f2 ff 85 c0 41
[20205.404290] RSP: 0018:ffffaa4c06fb79d0 EFLAGS: 00010246
[20205.404305] RAX: 0000000000000000 RBX: ffff95ad5ff9c008 RCX: 0000000000000001
[20205.404323] RDX: ffff95ad5ff9c008 RSI: 0000000000000000 RDI: ffff95a06fb4c008
[20205.404342] RBP: ffff959d06b95d18 R08: 0000000000000001 R09: 0000000000000000
[20205.404361] R10: ffff95ad9f268000 R11: 0000000000029500 R12: ffff95a06fb4c008
[20205.404379] R13: ffff95a06fb4c008 R14: 0000000000000000 R15: 0000000000000000
[20205.404398] FS: 00007f9a3f0c67c0(0000) GS:ffff95aabf880000(0000) knlGS:0000000000000000
[20205.404419] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[20205.404434] CR2: 00000000000001a0 CR3: 000000018a20a001 CR4: 00000000007606e0
[20205.404453] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[20205.404471] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[20205.404490] PKRU: 55555554
[20205.404498] Call Trace:
[20205.404703] ? _nv028097rm+0x62/0x110 [nvidia]
[20205.404884] ? _nv002233rm+0x9/0x20 [nvidia]
[20205.405066] ? _nv003684rm+0x1b/0x70 [nvidia]
[20205.405249] ? _nv013893rm+0x784/0x7f0 [nvidia]
[20205.405391] ? _nv034507rm+0xac/0xe0 [nvidia]
[20205.405572] ? _nv035933rm+0xb0/0x140 [nvidia]
[20205.405752] ? _nv035932rm+0x30f/0x4f0 [nvidia]
[20205.405932] ? _nv035927rm+0x60/0x70 [nvidia]
[20205.406113] ? _nv035928rm+0x7b/0xb0 [nvidia]
[20205.406254] ? _nv034415rm+0x40/0xe0 [nvidia]
[20205.406415] ? _nv000627rm+0x68/0x80 [nvidia]
[20205.406575] ? rm_cleanup_file_private+0xea/0x170 [nvidia]
[20205.406595] ? free_one_page+0x1d7/0x480
[20205.406712] ? nvidia_close+0x14c/0x2d0 [nvidia]
[20205.406834] ? nvidia_frontend_close+0x2a/0x40 [nvidia]
[20205.406854] ? __fput+0xb7/0x230
[20205.406867] ? task_work_run+0x8a/0xb0
[20205.406881] ? do_exit+0x3b2/0xc20
[20205.406892] ? do_group_exit+0x33/0xb0
[20205.406904] ? get_signal+0x15e/0x850
[20205.406916] ? kmem_cache_alloc+0x38/0x1b0
[20205.406930] ? do_signal+0x36/0x610
[20205.406941] ? __send_signal+0x332/0x4d0
[20205.407436] ? exit_to_usermode_loop+0x76/0xe0
[20205.407898] ? prepare_exit_to_usermode+0x93/0xd0
[20205.408351] ? page_fault+0x8/0x30
[20205.408802] ? retint_user+0x8/0x8
[20205.409220] Modules linked in: iptable_filter ip_tables nvidia_uvm(OE) sysmonitor(O) kbox(O) kboxdriver(O) sunrpc vfat fat loop nvidia_drm(POE) nvidia_modeset(POE) intel_rapl nvidia(POE) skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp dell_smbios wmi_bmof dell_wmi_descriptor kvm_intel kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_rapl_perf intel_cstate mgag200 intel_uncore drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr sg mei_me ttm mei lpc_ich i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler drm ksecurec(O) ext4 mbcache jbd2 sd_mod crc32c_intel ahci libahci ixgbe igb mdio i2c_algo_bit megaraid_sas(O) libata dca
[20205.412047] kernel fault(0x1) notification starting on CPU 4


Best Regards,

here NULL pointer occurs, I wanna to know why? is there any protection required using try catch?

Hi,
Can you share exact command line and any necessary specific input files along with detailed instructions to help reproduce this issue?

Thanks.