DriveWorks thread safety

What is the thread safety status of DriveWorks?
How do I design multi-threading program in context of DriveWorks, NvMedia, CUDA and so on?

A case:

I have 4 image streamers of same type (NVMEDIA -> CUDA) for each CSI sibling. One thread produces camera images (calls dwImageStreamer_postNvMedia()), 4 other threads consumes them (1 thread per 1 fixed streamer, calls dwImageStreamer_receiveCUDA()). I am faced with SIGSEGV on image receiving in dwImageStreamer_receiveCUDA():

0x0000007fb3b08308 in ?? () from /usr/lib/libcuda.so.1
(gdb) backtrace
#0  0x0000007fb3b08308 in ?? () from /usr/lib/libcuda.so.1
#1  0x0000007fb3b085cc in ?? () from /usr/lib/libcuda.so.1
#2  0x0000007fb3a0d430 in ?? () from /usr/lib/libcuda.so.1
#3  0x0000007fb3a79a54 in ?? () from /usr/lib/libcuda.so.1
#4  0x0000007fb3a129e0 in ?? () from /usr/lib/libcuda.so.1
#5  0x0000007fb3a12e2c in ?? () from /usr/lib/libcuda.so.1
#6  0x0000007fb3a16dc4 in ?? () from /usr/lib/libcuda.so.1
#7  0x0000007fb39f7e18 in ?? () from /usr/lib/libcuda.so.1
#8  0x0000007fb39f8748 in ?? () from /usr/lib/libcuda.so.1
#9  0x0000007fb3a72120 in cuEGLStreamConsumerAcquireFrame () from /usr/lib/libcuda.so.1
#10 0x0000007fb4d21a08 in ?? () from /usr/local/driveworks-0.3/targets/aarch64-linux/lib/libdriveworks.so.0
#11 0x0000007fb4d00f58 in ?? () from /usr/local/driveworks-0.3/targets/aarch64-linux/lib/libdriveworks.so.0
#12 0x0000007fb4d13cd8 in ?? () from /usr/local/driveworks-0.3/targets/aarch64-linux/lib/libdriveworks.so.0
#13 0x0000007fb4cfaf20 in dwImageStreamer_receiveCUDA () from /usr/local/driveworks-0.3/targets/aarch64-linux/lib/libdriveworks.so.0

But there is no such failure when I have only 1 streamer, 1 producer-thread and 1 consumer-thread. Seems like all image streamers have shared unprotected (in multithreading terms) context.

Same problem here. I have multiple NVMedia2CUDA streamers (one for each cam), when I post in separate threads, I get a segmentation fault. Any progress on this issue?

Dear NicoSchmidt,
Each image streamer has just one producer and one consumer. Can you please share more details about your use case

Dear SivaRamaKrishna,

yes, I use one producer and consumer only for each streamer, but I use multiple streamers of the same type (NvMedia -> CUDA) in parallel. They all are initialized with the same dwContext, though.

Here are the basic steps I do:

  • Initialize driveworks context (sdk)
  • Initialize sensor abstraction layer (sal)
  • Create gmsl camera sensor (sensor)
  • for each camera of this sensor create one Streamer: ``` dwImageStreamer_initialize(&streamerNvMedia2CUDA_YUV[i], &imageProperties, DW_IMAGE_CUDA, sdk); ```
  • Then for each camera create a thread and do in parallel:
    • get frame handle ``` dwSensorCamera_readFrame(&frameHandle[i], i, 1000000, sensor); ```
    • get NvMedia image ``` dwSensorCamera_getImageNvMedia(&NvMediaFrameYUVPtr[i], DW_CAMERA_PROCESSED_IMAGE, frameHandle[i]) ```
    • stream image to CUDA ``` dwImageStreamer_postNvMedia(NvMediaFrameYUVPtr[i], streamerNvMedia2CUDA_YUV[i]); ```
    • segmentation fault :(

Dear SivaRamaKrishna,

Since we are talking about thread safety issue, by any chance, could you please have a look to this potentially related issue (opened November 11th, 2017):
https://devtalk.nvidia.com/default/topic/1026673/driveworks-thread-safety/#5230241

Thanks,
Xavier

Dear NicoSchmidt,
Can you please file bug with all the supported material to reproduce it using nvonline or my bugs in forum. We will look into it

Hi SivaRamaKrishna,

I just wanted to say that we’ve been having precisely the same problem NicoSchmidt describes. Placing a global lock around the use of the imagestreamers avoids the problem, but also nullifies the benefit of parallelism. To be clear we’re using:

  • One thread per camera
  • One imagestreamer per camera/thread
  • Each thread produces then consumes its own images to the streamer
  • As far as we can tell, there shouldn’t be any need for any locks as everything is siloed into separate threads. However, somewhere in libcuda there is interaction between these separate imagestreamers that causes a segfault.

    Can you please:

  • Confirm that separate image streamers created with dwImageStreamer_initialize should be safe when used within independent threads
  • If so, fix this problem for the next release
  • If not, document the threading behaviour of imagestreamers
  • Dear johnmark,
    Can you please file bug on this topic with sample code(as Driveworks sample is preferable) to reproduce it our side. I will look into it.
    Please login to https://developer.nvidia.com/drive with your credentials. Please check MyAccount->MyBugs->Submit a new bug to file bug

    Dear SivaRamaKrishna, Dear all,

    It seems quite clear that there is a thread safety issue (or strong usage limitation) with image streamers.
    This is (at least) the third similar issue reported here.

    It would be really appreciated if someone from NVidia was able to play with image streamers in a multi-threaded environment.

    Also, moving to “MyBugs” means that we will loose the ability to exchange about this issue in a common thread…

    Thanks and best regards,
    Xavier

    Hi everyone,

    I agree with XavierR, moving these discussions into a private “MyBugs” forum makes it much harder for developers to learn from the experiences of others.

    Please can you, at very least, confirm that separate image streamers created with dwImageStreamer_initialize should be safe when used exclusively within independent threads.

    Thank you

    Dear XavierR & johnmark,

    As it is an application choice and a potential performance issue.

    • We will review the documentation to make it more clear the thread safety policies
    • We are modifying modules to have the application own the output data of any module, this way the application has more control about how to share the data

    We need more details on how are you running this. For example, if you have multiple GPUs (iGPU and dGPU for example) and there is a mismatch on GPU selection between allocations and usage of CUDA memory, you can get a crash.

    Hi,

    I am running on one of the Tegra X2/Pascal GPUs from a DRIVE AutoChauffeur PX2. The relevant code snippet that is crashing is below (I’ve purged extraneous parts, including status checking after each call, to make things clearer)

    struct Camera {
        uint32_t camera_id;
        dwSensorHandle_t sensor; // Shared between threads that use cameras on the same CSI port
        dwTime_t timeout_us = 300000;
        dwImageStreamerHandle_t streamer_nvm2cuda;
        dwImageFormatConverterHandle_t converter_yuv2rgb;
        dwImageCUDA frame_cuda_rgb;
    };
    
    
    void captureCamera(Camera* camera) {
        dwCameraFrameHandle_t frame_handle;
        dwImageNvMedia* frame_nvm_yuv = nullptr;
        dwImageCUDA* frame_cuda_yuv = nullptr;
    
        dwSensorCamera_readFrame(&frame_handle, camera->camera_id, camera->timeout_us, camera->sensor);
    
        dwSensorCamera_getImageNvMedia(&frame_nvm_yuv, DW_CAMERA_PROCESSED_IMAGE, frame_handle);
    
        dwImageStreamer_postNvMedia(frame_nvm_yuv, camera->streamer_nvm2cuda);
    
        dwImageStreamer_receiveCUDA(&frame_cuda_yuv, camera->timeout_us, camera->streamer_nvm2cuda);
    
        dwImageFormatConverter_copyConvertCUDA(&camera->frame_cuda_rgb,
                                               frame_cuda_yuv,
                                               camera->converter_yuv2rgb,
                                               0);
    
        dwImageStreamer_returnReceivedCUDA(frame_cuda_yuv, camera->streamer_nvm2cuda);
        dwImageStreamer_waitPostedNvMedia(&frame_nvm_yuv, 1000, camera->streamer_nvm2cuda);
        dwSensorCamera_returnFrame(&frame_handle);
        
    }
    

    This function is called in parallel by each thread, one for each camera.

    All elements of the Camera struct are local to their own thread and not shared between threads, with the exception of the dwSensorHandle_t when two cameras use the same CSI port. However, that is not the source of the crash as it occurs even with two cameras on different ports.

    Thank you for reviewing the documentation. However I am still unclear on whether I cam correct that independent dwImageStreamers should be thread safe in independent threads. If they should be, this is more than a performance/choice issue, it is a bug.

    Best,
    John-Mark

    Dear johmmark,
    Can you check the following things.

    • Use only one camera for each group
    • Keep dwSensorCamera_readFrame in critical section

    Hi,

    We do not need multiple GPUs to reproduce the issue.

    I also think that the thread safety policy should be clearly specified in the documentation, not only for image streamers.
    If we are using Driveworks in the wrong way, it would be nice to provide some kind of advance sample code (or at least architecture). This sample should demonstrate how to consume multiple camera feeds with different algorithms with as much parallelism as possible.

    Thanks
    Xavier

    We absolutely agree.

    As I said, this issue persists even when only using one camera from each CSI port. As for keeping readFrame in the critical section, I’m not sure precisely what you mean by this.

    I must also admit to being frustrated by your lack of answers on some very simple and direct questions regarding the intended behaviour of the library and whether our assumptions about it are correct. Are we making incorrect assumptions? How should the library behave in a parallel environment? Is there anywhere in the documentation discussing thread safety that we might have missed?

    Dear johnmark,
    We will be including thread safety policy of API in next coming releases of driveworks.
    Imagestreamer is expected to work in multi thread enviroment. It would be great if you can file a bug with sample code to demostrate the case, so that we can find the actual root cause.

    Dear SivaRamaKrishna,

    As far as I know, there is a sample showing the usage of image streamer across two threads: one acting as a producer (CUDA) and one as a consumer (GL).
    But in this sample, image streamers are well synchronized / they are actually used to synchronize both threads.

    In our use case, we have independent image streamers, used in independent threads so their internal state combination is not predictable as in your multithread sample.

    Thanks and Best regards,
    Xavier

    Hi,
    We’re also experiencing these issues when using two cameras to different CSI ports. We get a stack trace as below. If we put a mutex lock around our whole publish handling we get rid of the segfault, but we get a delay that is far to high to be acceptable. When might we expect the thread safety policy? Is it possible to get a draft already?

    Thanks in advance,
    //Mårten

    #0 0x0000007fb50b04d4 in cuVDPAUCtxCreate () from /usr/lib/libcuda.so.1
    #1 0x0000007fb50b07a0 in cuVDPAUCtxCreate () from /usr/lib/libcuda.so.1
    #2 0x0000007fb4fae284 in cuEGLApiInit () from /usr/lib/libcuda.so.1
    #3 0x0000007fb501cee8 in cuVDPAUCtxCreate () from /usr/lib/libcuda.so.1
    #4 0x0000007fb4fb36dc in cuEGLApiInit () from /usr/lib/libcuda.so.1
    #5 0x0000007fb4fb4504 in cuEGLApiInit () from /usr/lib/libcuda.so.1
    #6 0x0000007fb4fb848c in cuEGLApiInit () from /usr/lib/libcuda.so.1
    #7 0x0000007fb4f97d44 in cuEGLApiInit () from /usr/lib/libcuda.so.1
    #8 0x0000007fb4f984e8 in cuEGLApiInit () from /usr/lib/libcuda.so.1
    #9 0x0000007fb5015ff4 in cuEGLStreamConsumerAcquireFrame () from /usr/lib/libcuda.so.1
    #10 0x0000007fb60871d8 in dwFrameCapture_appendFrameGL () from /usr/local/driveworks/targets/aarch64-linux/lib/libdriveworks.so.0
    #11 0x0000007fb5f3d5a0 in dwSensorCamera_getTimestamp () from /usr/local/driveworks/targets/aarch64-linux/lib/libdriveworks.so.0
    #12 0x0000007fb6073bf0 in dwImageStreamer_initialize () from /usr/local/driveworks/targets/aarch64-linux/lib/libdriveworks.so.0
    #13 0x0000007fb605f4bc in dwImageStreamer_receiveCUDA () from /usr/local/driveworks/targets/aarch64-linux/lib/libdriveworks.so.0
    #14 0x00000000005c4f5c in DwImageStreamer::receive (this=0xa317f68, image=…)

    @SivaRamaKrishna,
    I also suffered from this kind of race condition of CUDA resource when in DriveWork processing. please refer bug #200443881https://partners.nvidia.com/bug/viewbug/200443881 with following logs,

    Program terminated with signal SIGSEGV, Segmentation fault.
    #0 0x0000007f802b94d4 in cuVDPAUCtxCreate () from /usr/lib/libcuda.so.1
    [Current thread is 1 (Thread 0x7f44b580d0 (LWP 12562))]
    (cuda-gdb) bt
    #0 0x0000007f802b94d4 in cuVDPAUCtxCreate () from /usr/lib/libcuda.so.1
    #1 0x0000007f802b97a0 in cuVDPAUCtxCreate () from /usr/lib/libcuda.so.1
    #2 0x0000007f801b7284 in cuEGLApiInit () from /usr/lib/libcuda.so.1
    #3 0x0000007f80225ee8 in cuVDPAUCtxCreate () from /usr/lib/libcuda.so.1
    #4 0x0000007f801bc6dc in cuEGLApiInit () from /usr/lib/libcuda.so.1
    #5 0x0000007f801bd504 in cuEGLApiInit () from /usr/lib/libcuda.so.1
    #6 0x0000007f801c148c in cuEGLApiInit () from /usr/lib/libcuda.so.1
    #7 0x0000007f801a0d44 in cuEGLApiInit () from /usr/lib/libcuda.so.1
    #8 0x0000007f801a14e8 in cuEGLApiInit () from /usr/lib/libcuda.so.1
    #9 0x0000007f8021ef0c in cuEGLStreamConsumerAcquireFrame () from /usr/lib/libcuda.so.1
    #10 0x0000007f83318f08 in dwFrameCapture_appendFrameGL () from /usr/local/driveworks-0.6/targets/aarch64-linux/lib/libdriveworks.so.0
    #11 0x0000007f831d02e0 in dwSensorCamera_getTimestamp () from /usr/local/driveworks-0.6/targets/aarch64-linux/lib/libdriveworks.so.0
    #12 0x0000007f83305918 in dwImageStreamer_initialize () from /usr/local/driveworks-0.6/targets/aarch64-linux/lib/libdriveworks.so.0
    #13 0x0000007f832f11e4 in dwImageStreamer_receiveCUDA () from /usr/local/driveworks-0.6/targets/aarch64-linux/lib/libdriveworks.so.0


    Sep 1 00:59:53 nvidia kernel: [84708.123185] eqos ioctl: HW PTP not running
    Sep 1 00:59:55 nvidia kernel: [84709.673466] nvmap_alloc_handle: PID 12539: sample_camera_m: WARNING: All NvMap Allocations must have a tag to identify the subsystem allocating memory.Please pass the tag to the API call NvRmMemHanldeAllocAttr() or relevant.
    Sep 1 00:59:55 nvidia kernel: [84709.993708] nvgpu: 0000:04:00.0 gk20a_gr_isr:6023 [ERR] sked exception 02200000
    Sep 1 00:59:55 nvidia kernel: [84710.060608] nvgpu: 0000:04:00.0 gk20a_gr_isr:6023 [ERR] sked exception 02200000
    Sep 1 00:59:55 nvidia kernel: [84710.124536] sample_camera_m[12562]: unhandled level 2 translation fault (11) at 0x00000018, esr 0x92000006
    Sep 1 00:59:55 nvidia kernel: [84710.124542] pgd = ffffffc1c3ee1000
    Sep 1 00:59:55 nvidia kernel: [84710.125051] [00000018] *pgd=00000001748a1003
    Sep 1 00:59:55 nvidia kernel: [84710.128676] , *pud=00000001748a1003
    Sep 1 00:59:55 nvidia kernel: [84710.128681] , *pmd=0000000000000000
    Sep 1 00:59:55 nvidia kernel: [84710.128682]
    Sep 1 00:59:55 nvidia kernel: [84710.128684]
    Sep 1 00:59:55 nvidia kernel: [84710.128689] CPU: 0 PID: 12562 Comm: sample_camera_m Not tainted 4.9.38-rt25-tegra #1
    Sep 1 00:59:55 nvidia kernel: [84710.128691] Hardware name: drive-px2-a (DT)
    Sep 1 00:59:55 nvidia kernel: [84710.128693] task: ffffffc0fb3ec880 task.stack: ffffffc16a8a4000
    Sep 1 00:59:55 nvidia kernel: [84710.128696] PC is at 0x7f802b94d4
    Sep 1 00:59:55 nvidia kernel: [84710.128697] LR is at 0x7f802b95ac
    Sep 1 00:59:55 nvidia kernel: [84710.128701] pc : [<0000007f802b94d4>] lr : [<0000007f802b95ac>] pstate: 20000000
    Sep 1 00:59:55 nvidia kernel: [84710.129925] sp : 0000007f44b54710
    Sep 1 00:59:55 nvidia kernel: [84710.129927] x29: 0000007f44b54a70 x28: 0000007f44b548b0
    Sep 1 00:59:55 nvidia kernel: [84710.129931] x27: 0000007f44b548f8 x26: 00000000008ed7f0
    Sep 1 00:59:55 nvidia kernel: [84710.129934] x25: 0000000000d40860 x24: 0000000008000000
    Sep 1 00:59:55 nvidia kernel: [84710.129938] x23: 0000000000000000 x22: 0000007f3400c120
    Sep 1 00:59:55 nvidia kernel: [84710.129941] x21: 0000000000000000 x20: 0000000000000000
    Sep 1 00:59:55 nvidia kernel: [84710.129944] x19: 0000000000000000 x18: 0000000000000000
    Sep 1 00:59:55 nvidia kernel: [84710.129946] x17: 0000007f80a5c340 x16: 0000007f8090d610
    Sep 1 00:59:55 nvidia kernel: [84710.129949] x15: 0000000fc0200000 x14: 0000000000000001
    Sep 1 00:59:55 nvidia kernel: [84710.129952] x13: 0000000000000288 x12: 0000000000000633
    Sep 1 00:59:55 nvidia kernel: [84710.129979] x11: 0000000000000003 x10: 000000000089dc80
    Sep 1 00:59:55 nvidia kernel: [84710.129985] x9 : 00000000000000bc x8 : 0000000000000008
    Sep 1 00:59:55 nvidia kernel: [84710.129988] x7 : 00000000000000bc x6 : 000000000000002e
    Sep 1 00:59:55 nvidia kernel: [84710.129991] x5 : 0000000000000038 x4 : 0000000000000100
    Sep 1 00:59:55 nvidia kernel: [84710.129993] x3 : 0000000000000000 x2 : 0000000001000000
    Sep 1 00:59:55 nvidia kernel: [84710.129996] x1 : 0000000000000000 x0 : 0000000000000000
    Sep 1 00:59:55 nvidia kernel: [84710.129999]
    Sep 1 00:59:55 nvidia kernel: [84710.130005] Library at 0x7f802b94d4: 0x7f80018000 /usr/lib/libcuda.so.1
    Sep 1 00:59:55 nvidia kernel: [84710.135293] Library at 0x7f802b95ac: 0x7f80018000 /usr/lib/libcuda.so.1
    Sep 1 00:59:55 nvidia kernel: [84710.141913] vdso base = 0x7f850d3000