Segfault in nvv4l2h264enc (minimal reproducible example included)

• Hardware Platform (GPU)
• DeepStream Version 7.1
• NVIDIA GPU Driver Version (valid for GPU only) 550, 560, 570
• Issue Type (bugs)
• How to reproduce the issue ? GitHub - lumeohq/deepstream-encoder-segfault

Hello!

We see frequent segfaults when running multiple pipelines in multiple threads in the same process. In some conditions, the program segfaults at 80% chance in the first 5 seconds. It is quite easily reproducible. We tested this on 2 different machines with different drivers. 535 driver is not affected, but 550, 560 and 570 are affected.

More details in the readme at GitHub. MRE contains a script that runs a simple C application using nvcr.io/nvidia/deepstream:7.1-triton-multiarch docker image.

GDB backtrace:

Thread 32 "videotestsrc2:s" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x732e31000640 (LWP 222)]
0x0000732e71417941 in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1

I used the 570 driver on 3070TI to test the case you provided, and it can exit normally without any crash.

wget https://us.download.nvidia.com/tesla/570.133.20/nvidia-driver-local-repo-ubuntu2204-570.133.20_1.0-1_amd64.deb
sudo dpkg -i nvidia-driver-local-repo-ubuntu2204-570.133.20_1.0-1_amd64.deb
sudo cp /var/nvidia-driver-local-repo-ubuntu2204-570.133.20/nvidia-driver-local-6AA56764-keyring.gpg /usr/share/keyrings/

sudo apt update
sudo apt install cuda-drivers

However, 3070TI has a limit on the number of encoder instances. You can refer to this table

Are you testing on ubuntu2204? We have only tested DS-7.1 on ubuntu2204

We have Ubuntu 22.04 in prod, and on my dev machine it is Mint 22.1 (similar to Ubuntu 24.04), it crashes too. The driver is installed from Linux Mint driver manager: 570.144 (open).

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.144                Driver Version: 570.144        CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070 Ti     Off |   00000000:0A:00.0  On |                  N/A |
|  0%   48C    P8             28W /  310W |    1340MiB /   8192MiB |      8%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

I also tried installing using NVIDIA-Linux-x86_64-570.144.run file downloaded from NVIDIA, and the test fails.

Have you tried running run.sh multiple times? There’s some probability of failing, for instance I ran it 5 times:

Running in normal mode...
Exit code: 139
...
Running in normal mode...
Exit code: 0
...
Running in normal mode...
Exit code: 0
...
Running in normal mode...
Exit code: 0
...
Running in normal mode...
Exit code: 139

You can also run

sudo apt install python3-tabulate
./table.py

.. and leave for half an hour. If the resulting table is all zeros, then it works well on your setup.

  1. For 3070TI, A maximum of 8 encoding instances are supported at the same time, so I modified the following parameters
thread_range = range(1, 3)
encoders_range = range(1, 5)

2.After running for 30 rounds, one crash occurred.

+-------------+---------+---------+---------+-----------------+
|   Thr \ Enc | 1       | 2       | 3       | 4               |
+=============+=========+=========+=========+=================+
|           1 | {0: 31} | {0: 31} | {0: 31} | {0: 31}         |
+-------------+---------+---------+---------+-----------------+
|           2 | {0: 31} | {0: 30} | {0: 30} | {0: 29, 139: 1} |
+-------------+---------+---------+---------+-----------------+

3.I captured the stack, This seems to be a driver issue and cannot be solved in the DeepStream SDK

(gdb) bt
#0  0x00007f736c218f61 in  () at /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#1  0x00007f736c21909a in  () at /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#2  0x00007f736c294a16 in  () at /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#3  0x00007f736c29517d in  () at /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#4  0x00007f7384f27ac3 in  () at /usr/lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f7384fb8a04 in clone () at /usr/lib/x86_64-linux-gnu/libc.so.6

4.We are discussing this internally, please use the recommended 535 version of the closed source driver

This is also the driver we have tested for compatibility

It’s great that you could reproduce it! I will try the driver from your post.

Yeah, we know that. But even if the pipeline fails due to unavailable encoders, the process shouldn’t segfault in any case. It just turns out that when the pipeline errors out, the probability of segfault is higher. And sometimes it segfaults even if the limit is not reached, as in your case above.

The reason why we start more encoders than it is available is because we found no way to determine how many encoder sessions are actually available at the moment using NVML library or other tools. There may be other software running on the same machine in different containers that uses encoders. We just start sequentially pipelines with 1..8 encoders and find out which pipelines succeed. For example, if the pipeline with 5 encoders succeeded, but with 6 failed to start, then we would use 5xHW encoders in the main pipeline, and 3 encoders would be replaced by software encoders. If you can suggest an approach how to find the remaining not used encoder sessions count, it would help us to mitigate the issue partially.

You can try to create an encoder instance using nvEncOpenEncodeSessionEx , to test the number of encoder instances

1 Like

535 works well with regard to this issue, but here I can see that driver must be 560 for RTX

Use version: 535.183.06 for production deployments for Data Center GPUs
Please Note that for GeForce and RTX cards GPU driver must be 560.35.03 or higher.

Same in the link you posted:

R535.183.06(Data Center GPUs), R560.35.03(RTX GPUs)

So I wonder whether 535 is fully compatible with DS7.1 on RTX cards. Maybe there’s some feature missing, and that’s why the 560 driver is required? A bit confused with driver numbers :)

And another question: do you have an ETA on fixing the issue?