Regression with use after free linux drivers after 520.56.06 inside nvenc

Hi,

neighbor topic Linux after 520.56.06 drivers randomly segfault nvenc inside nvcuvid thread ignored long time, so i try in this place. After 520.56.06 nvidia drivers nvenc contain use after free bug. I check 525.125.06,535.54.03,535.86.05, 535.113.01, 545.23.06, cuda 12.2, ubuntu 22.04, GTX 1070 hardware. In attach, patch for Video_Codec_SDK_12.1.14 with Samples/AppEncode/AppEncCudaBug new sample, this code create and release 300 encode sessions. valgrind detect invalid read. And code some time segfaulted.

reproduce:

src $ patch -p1 < Video_Codec_SDK_12.1.14_bug.patch.txt
src $ cd src/Video_Codec_SDK_12.1.14/Samples/AppEncode/AppEncCudaBug
src/Video_Codec_SDK_12.1.14/Samples/AppEncode/AppEncCudaBug $ mkdir build
src/Video_Codec_SDK_12.1.14/Samples/AppEncode/AppEncCudaBug $ cd build
src/Video_Codec_SDK_12.1.14/Samples/AppEncode/AppEncCudaBug/build $ cmake .. 
src/Video_Codec_SDK_12.1.14/Samples/AppEncode/AppEncCudaBug/build $ make 
src/Video_Codec_SDK_12.1.14/Samples/AppEncode/AppEncCudaBug/build $ ./AppEncCudaBug
src/Video_Codec_SDK_12.1.14/Samples/AppEncode/AppEncCudaBug/build $ valgrind --trace-children=yes --leak-check=full --log-file=valgrind.txt ./AppEncCudaBug

invalid read part valgrind.txt

==450484== Thread 10:
==450484== Invalid read of size 4
==450484==    at 0x6A16F01: ??? (in /usr/lib64/libnvcuvid.so.545.23.06)
==450484==    by 0x6A17039: ??? (in /usr/lib64/libnvcuvid.so.545.23.06)
==450484==    by 0x6A863B5: ??? (in /usr/lib64/libnvcuvid.so.545.23.06)
==450484==    by 0x6A86B1C: ??? (in /usr/lib64/libnvcuvid.so.545.23.06)
==450484==    by 0x787431B: start_thread (pthread_create.c:444)
==450484==    by 0x78F76AF: clone (clone.S:100)
==450484==  Address 0x26606c8c is 515,932 bytes inside an unallocated block of size 1,826,752 in arena "client"
==450484== 
==450484== Invalid read of size 4
==450484==    at 0x6A16F07: ??? (in /usr/lib64/libnvcuvid.so.545.23.06)
==450484==    by 0x6A17039: ??? (in /usr/lib64/libnvcuvid.so.545.23.06)
==450484==    by 0x6A863B5: ??? (in /usr/lib64/libnvcuvid.so.545.23.06)
==450484==    by 0x6A86B1C: ??? (in /usr/lib64/libnvcuvid.so.545.23.06)
==450484==    by 0x787431B: start_thread (pthread_create.c:444)
==450484==    by 0x78F76AF: clone (clone.S:100)
==450484==  Address 0x26606c78 is 515,912 bytes inside an unallocated block of size 1,826,752 in arena "client"

segfault:

AddressSanitizer:DEADLYSIGNAL                                                                                                                                            
=================================================================                                                                                                        
==444460==ERROR: AddressSanitizer: SEGV on unknown address 0x7f3bf400a40c (pc 0x7f3c02c16f01 bp 0x7f3c0378b5a8 sp 0x7f3bd3e386f0 T-1)                                    
==444460==The signal is caused by a READ memory access.                                                                                                                  
    #0 0x7f3c02c16f01  (/usr/lib64/libnvcuvid.so.1+0x16f01) (BuildId: e45304d759eabb77c567f3332a917d9eb61ab913)                                                          
    #1 0x7f3c02c17039  (/usr/lib64/libnvcuvid.so.1+0x17039) (BuildId: e45304d759eabb77c567f3332a917d9eb61ab913)                                                          
    #2 0x7f3c02c863b5  (/usr/lib64/libnvcuvid.so.1+0x863b5) (BuildId: e45304d759eabb77c567f3332a917d9eb61ab913)                                                          
    #3 0x7f3c02c86b1c  (/usr/lib64/libnvcuvid.so.1+0x86b1c) (BuildId: e45304d759eabb77c567f3332a917d9eb61ab913)                                                          
    #4 0x7f3c026b231b in start_thread /var/tmp/portage/sys-libs/glibc-2.37-r7/work/glibc-2.37/nptl/pthread_create.c:444                                                  
    #5 0x7f3c0273579b in clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81                                                                                            
                                                                                                                                                                         
AddressSanitizer can not provide additional info.                                                                                                                        
SUMMARY: AddressSanitizer: SEGV (/usr/lib64/libnvcuvid.so.1+0x16f01) (BuildId: e45304d759eabb77c567f3332a917d9eb61ab913)                                                 
==444460==ABORTING                                                                                                                    

Video_Codec_SDK_12.1.14_bug.patch.txt (10.8 KB)

@khizbulin
I have filed a bug 4340965 internally for tracking purpose.
Team will review and get back to you.

1 Like

@khizbulin

I was trying to repro issue locally with the shared patch and steps using latest released drivers from 545 branch but could not see any segv errors.
Could you please cross verify reliable repro patch or steps.

root@test-Alienware-Aurora-R12:~/Video_Codec_SDK_12.1.14.bug/Samples/build/AppEncode/AppEncCudaBug# cat valgrind.txt
==30650== Memcheck, a memory error detector
==30650== Copyright (C) 2002-2017, and GNU GPL’d, by Julian Seward et al.
==30650== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==30650== Command: ./AppEncCudaBug
==30650== Parent PID: 4072
==30650==
==30650==
==30650== HEAP SUMMARY:
==30650== in use at exit: 852 bytes in 24 blocks
==30650== total heap usage: 24 allocs, 0 frees, 852 bytes allocated
==30650==
==30650== LEAK SUMMARY:
==30650== definitely lost: 0 bytes in 0 blocks
==30650== indirectly lost: 0 bytes in 0 blocks
==30650== possibly lost: 0 bytes in 0 blocks
==30650== still reachable: 852 bytes in 24 blocks
==30650== suppressed: 0 bytes in 0 blocks
==30650== Reachable blocks (those to which a pointer was found) are not shown.
==30650== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==30650==
==30650== For lists of detected and suppressed errors, rerun with: -s
==30650== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Hi,

ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) - very strange output. Nvenc produce thousands valgrind errors always. AppEncCudaBug SIGSEGV happens very rarely, because the reading is very small and linux does not have time to take up memory and use for other stuff. I attach my full valgrind.log.
typescript1.txt (293.5 KB)

hizel@core-hizel ~/src/Video_Codec_SDK_12.1.14.bug/Samples/AppEncode/AppEncCudaBug/build $ grep -A 10 'Invalid read of size 4' typescript1.txt 
==3691051== Invalid read of size 4
==3691051==    at 0x6A16F01: ??? (in /usr/lib64/libnvcuvid.so.545.29.02)
==3691051==    by 0x6A17039: ??? (in /usr/lib64/libnvcuvid.so.545.29.02)
==3691051==    by 0x6A863B5: ??? (in /usr/lib64/libnvcuvid.so.545.29.02)
==3691051==    by 0x6A86B1C: ??? (in /usr/lib64/libnvcuvid.so.545.29.02)
==3691051==    by 0x787431B: start_thread (pthread_create.c:444)
==3691051==    by 0x78F76AF: clone (clone.S:100)
==3691051==  Address 0xea73cdc is 6,780 bytes inside a block of size 16,536 free'd
==3691051==    at 0x4843A56: free (vg_replace_malloc.c:985)
==3691051==    by 0x6A3DF17: ??? (in /usr/lib64/libnvcuvid.so.545.29.02)
==3691051==    by 0x6A2433A: ??? (in /usr/lib64/libnvcuvid.so.545.29.02)
--
==3691051== Invalid read of size 4
==3691051==    at 0x6A16F07: ??? (in /usr/lib64/libnvcuvid.so.545.29.02)
==3691051==    by 0x6A17039: ??? (in /usr/lib64/libnvcuvid.so.545.29.02)
==3691051==    by 0x6A863B5: ??? (in /usr/lib64/libnvcuvid.so.545.29.02)
==3691051==    by 0x6A86B1C: ??? (in /usr/lib64/libnvcuvid.so.545.29.02)
==3691051==    by 0x787431B: start_thread (pthread_create.c:444)
==3691051==    by 0x78F76AF: clone (clone.S:100)
==3691051==  Address 0xea73cc8 is 6,760 bytes inside a block of size 16,536 free'd
==3691051==    at 0x4843A56: free (vg_replace_malloc.c:985)
==3691051==    by 0x6A3DF17: ??? (in /usr/lib64/libnvcuvid.so.545.29.02)
==3691051==    by 0x6A2433A: ??? (in /usr/lib64/libnvcuvid.so.545.29.02)
$ nvidia-smi 
Tue Nov 14 15:51:41 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.02              Driver Version: 545.29.02    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1070        Off | 00000000:01:00.0  On |                  N/A |
| 19%   46C    P5              15W / 151W |    645MiB /  8192MiB |     38%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Hi,

looks like prevision reproduction code too hard. I write a simper:

simplelogger::Logger *logger = simplelogger::LoggerFactory::CreateConsoleLogger();

int main(int argc, char **argv)
{
        std::mt19937_64 eng{std::random_device{}()};
        ck(cuInit(0));
        CUdevice cuDevice = 0;
        CUcontext cuContext = NULL;
        ck(cuDeviceGet(&cuDevice, 0));
        ck(cuCtxCreate(&cuContext, 0, cuDevice));

        NV_ENC_BUFFER_FORMAT eFormat = NV_ENC_BUFFER_FORMAT_NV12;

        auto enc1 = new NvEncoderCuda(cuContext, 240, 160, eFormat);
        NvEncoderInitParam encodeCLIOptions("-codec h264 -gop 54 -bitrate 82000 -profile main -preset p2 -rc CBR -bf 2 -fps 24");

        NV_ENC_INITIALIZE_PARAMS initializeParams = { NV_ENC_INITIALIZE_PARAMS_VER };
        NV_ENC_CONFIG encodeConfig = { NV_ENC_CONFIG_VER };

        initializeParams.encodeConfig = &encodeConfig;
        enc1->CreateDefaultEncoderParams(&initializeParams, encodeCLIOptions.GetEncodeGUID(), encodeCLIOptions.GetPresetGUID(), encodeCLIOptions.GetTuningInfo());
        encodeCLIOptions.SetInitParams(&initializeParams, eFormat);
        enc1->CreateEncoder(&initializeParams);


        auto enc2 = new NvEncoderCuda(cuContext, 240, 160, eFormat);
        enc2->CreateEncoder(&initializeParams);

        auto enc3 = new NvEncoderCuda(cuContext, 240, 160, eFormat);
        enc3->CreateEncoder(&initializeParams);
        delete enc1;
        usleep(1000000);
        delete enc2;
        delete enc3;
        return 0;
}

this code in valgrind produce error:

==62877== Invalid read of size 4                                                                                                                                                                    
==62877==    at 0x6A16F01: ??? (in /usr/lib64/libnvcuvid.so.545.29.06)                                                                                                                              
==62877==    by 0x6A17039: ??? (in /usr/lib64/libnvcuvid.so.545.29.06)                                                                                                                              
==62877==    by 0x6A863B5: ??? (in /usr/lib64/libnvcuvid.so.545.29.06)                                                                                                                              
==62877==    by 0x6A86B1C: ??? (in /usr/lib64/libnvcuvid.so.545.29.06)      
==62877==    by 0x68C5398: start_thread (pthread_create.c:444)                                                                                                                                      
==62877==    by 0x6938BAF: clone (clone.S:100)                                                                                                                                                      
==62877==  Address 0x26206c8c is 27,660 bytes inside a block of size 1,338,480 free'd                                                                                                               
==62877==    at 0x4841A56: free (vg_replace_malloc.c:985)                                                                                                                                           
==62877==    by 0x6A24344: ??? (in /usr/lib64/libnvcuvid.so.545.29.06)                                                                                                                              
==62877==    by 0x6A9F9E4: ??? (in /usr/lib64/libnvcuvid.so.545.29.06)
==62877==    by 0x6A9FBA8: ??? (in /usr/lib64/libnvcuvid.so.545.29.06)
==62877==    by 0x660364F: ??? (in /usr/lib64/libnvidia-encode.so.545.29.06)
==62877==    by 0x660588A: ??? (in /usr/lib64/libnvidia-encode.so.545.29.06)
==62877==    by 0x661F91D: ??? (in /usr/lib64/libnvidia-encode.so.545.29.06)
==62877==    by 0x12E497: NvEncoder::DestroyHWEncoder() (in /home/hizel/src/Video_Codec_SDK_12.1.14.bug/Samples/AppEncode/AppEncCudaBug/build/AppEncCudaBug)
==62877==    by 0x12B931: NvEncoder::~NvEncoder() (in /home/hizel/src/Video_Codec_SDK_12.1.14.bug/Samples/AppEncode/AppEncCudaBug/build/AppEncCudaBug)                                              ==62877==    by 0x13A8A1: NvEncoderCuda::~NvEncoderCuda() (in /home/hizel/src/Video_Codec_SDK_12.1.14.bug/Samples/AppEncode/AppEncCudaBug/build/AppEncCudaBug)
==62877==    by 0x13A8BD: NvEncoderCuda::~NvEncoderCuda() (in /home/hizel/src/Video_Codec_SDK_12.1.14.bug/Samples/AppEncode/AppEncCudaBug/build/AppEncCudaBug)                                      ==62877==    by 0x1164AB: main (in /home/hizel/src/Video_Codec_SDK_12.1.14.bug/Samples/AppEncode/AppEncCudaBug/build/AppEncCudaBug)
==62877==  Block was alloc'd at                                                                                                                                                                     
==62877==    at 0x483E787: malloc (vg_replace_malloc.c:442)
==62877==    by 0x6A3C69C: ??? (in /usr/lib64/libnvcuvid.so.545.29.06)
==62877==    by 0x6A99EF1: ??? (in /usr/lib64/libnvcuvid.so.545.29.06)
==62877==    by 0x6605943: ??? (in /usr/lib64/libnvidia-encode.so.545.29.06)
==62877==    by 0x661C8EA: ??? (in /usr/lib64/libnvidia-encode.so.545.29.06)        
==62877==    by 0x12AF0F: NvEncoder::NvEncoder(_NV_ENC_DEVICE_TYPE, void*, unsigned int, unsigned int, _NV_ENC_BUFFER_FORMAT, unsigned int, bool, bool, bool, bool) (in /home/hizel/src/Video_Codec_
SDK_12.1.14.bug/Samples/AppEncode/AppEncCudaBug/build/AppEncCudaBug)
==62877==    by 0x13A4A0: NvEncoderCuda::NvEncoderCuda(CUctx_st*, unsigned int, unsigned int, _NV_ENC_BUFFER_FORMAT, unsigned int, bool, bool, bool) (in /home/hizel/src/Video_Codec_SDK_12.1.14.bug
/Samples/AppEncode/AppEncCudaBug/build/AppEncCudaBug)
==62877==    by 0x116285: main (in /home/hizel/src/Video_Codec_SDK_12.1.14.bug/Samples/AppEncode/AppEncCudaBug/build/AppEncCudaBug)

And also, if delete enc1 encoder last, then no error in valgrind and no potential segfault. In my multithread program i like to destruct encoders in any order and this logic works before 520.56.06 version drivers.