• Hardware Platform (GPU) • DeepStream Version 7.1 • NVIDIA GPU Driver Version (valid for GPU only) 550, 560, 570 • Issue Type (bugs) • How to reproduce the issue ?GitHub - lumeohq/deepstream-encoder-segfault
Hello!
We see frequent segfaults when running multiple pipelines in multiple threads in the same process. In some conditions, the program segfaults at 80% chance in the first 5 seconds. It is quite easily reproducible. We tested this on 2 different machines with different drivers. 535 driver is not affected, but 550, 560 and 570 are affected.
More details in the readme at GitHub. MRE contains a script that runs a simple C application using nvcr.io/nvidia/deepstream:7.1-triton-multiarch docker image.
GDB backtrace:
Thread 32 "videotestsrc2:s" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x732e31000640 (LWP 222)]
0x0000732e71417941 in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
We have Ubuntu 22.04 in prod, and on my dev machine it is Mint 22.1 (similar to Ubuntu 24.04), it crashes too. The driver is installed from Linux Mint driver manager: 570.144 (open).
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3070 Ti Off | 00000000:0A:00.0 On | N/A |
| 0% 48C P8 28W / 310W | 1340MiB / 8192MiB | 8% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
I also tried installing using NVIDIA-Linux-x86_64-570.144.run file downloaded from NVIDIA, and the test fails.
Have you tried running run.sh multiple times? There’s some probability of failing, for instance I ran it 5 times:
Running in normal mode...
Exit code: 139
...
Running in normal mode...
Exit code: 0
...
Running in normal mode...
Exit code: 0
...
Running in normal mode...
Exit code: 0
...
Running in normal mode...
Exit code: 139
You can also run
sudo apt install python3-tabulate
./table.py
.. and leave for half an hour. If the resulting table is all zeros, then it works well on your setup.
3.I captured the stack, This seems to be a driver issue and cannot be solved in the DeepStream SDK
(gdb) bt
#0 0x00007f736c218f61 in () at /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#1 0x00007f736c21909a in () at /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#2 0x00007f736c294a16 in () at /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#3 0x00007f736c29517d in () at /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#4 0x00007f7384f27ac3 in () at /usr/lib/x86_64-linux-gnu/libc.so.6
#5 0x00007f7384fb8a04 in clone () at /usr/lib/x86_64-linux-gnu/libc.so.6
4.We are discussing this internally, please use the recommended 535 version of the closed source driver
This is also the driver we have tested for compatibility
It’s great that you could reproduce it! I will try the driver from your post.
Yeah, we know that. But even if the pipeline fails due to unavailable encoders, the process shouldn’t segfault in any case. It just turns out that when the pipeline errors out, the probability of segfault is higher. And sometimes it segfaults even if the limit is not reached, as in your case above.
The reason why we start more encoders than it is available is because we found no way to determine how many encoder sessions are actually available at the moment using NVML library or other tools. There may be other software running on the same machine in different containers that uses encoders. We just start sequentially pipelines with 1..8 encoders and find out which pipelines succeed. For example, if the pipeline with 5 encoders succeeded, but with 6 failed to start, then we would use 5xHW encoders in the main pipeline, and 3 encoders would be replaced by software encoders. If you can suggest an approach how to find the remaining not used encoder sessions count, it would help us to mitigate the issue partially.
535 works well with regard to this issue, but here I can see that driver must be 560 for RTX
Use version: 535.183.06 for production deployments for Data Center GPUs
Please Note that for GeForce and RTX cards GPU driver must be 560.35.03 or higher.
R535.183.06(Data Center GPUs), R560.35.03(RTX GPUs)
So I wonder whether 535 is fully compatible with DS7.1 on RTX cards. Maybe there’s some feature missing, and that’s why the 560 driver is required? A bit confused with driver numbers :)
And another question: do you have an ETA on fixing the issue?