Segmentation fault in enqueue() when using multithreading on Xavier (DDPX)

carlos.galvez · June 10, 2020, 11:18am

Hi,

I get a segmentation fault when I have 2 threads running one DNN each via TRT on Xavier (DDPX). The segfault happens when both threads reach the nvinfer1::enqueue() exactly at the same time.

Here’s the backtrace of the segfaulted thread:
#0 0x0000007f92cfe14c in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#1 0x0000007f92cfe618 in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#2 0x0000007f92cfe744 in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#3 0x0000007f9356c198 in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#4 0x0000007f9356d924 in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#5 0x0000007f931043f0 in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#6 0x0000007f92b6e658 in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#7 0x0000007f92b6ea38 in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#8 0x0000007f92b6f334 in cudnnConvolutionForward () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#9 0x0000007fb23c9ec4 in nvinfer1::rt::cuda::CudnnConvolutionRunner::execute(nvinfer1::rt::CommonContext const&, nvinfer1::rt::ExecutionParameters const&) const () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.6
#10 0x0000007fb21bdb58 in nvinfer1::rt::ExecutionContext::enqueueInternal(CUevent_st**) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.6
#11 0x0000007fb21c02d0 in nvinfer1::rt::ExecutionContext::enqueue(int, void**, CUstream_st*, CUevent_st**) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.6

The backtrace of the other thread varies in content. For example:
#0 0x0000007f892cfab4 in ?? () from /usr/lib/libcuda.so.1
#1 0x0000007f892f7188 in ?? () from /usr/lib/libcuda.so.1
#2 0x0000007f892f7440 in ?? () from /usr/lib/libcuda.so.1
#3 0x0000007f8944ef00 in ?? () from /usr/lib/libcuda.so.1
#4 0x0000007f8926e0b8 in ?? () from /usr/lib/libcuda.so.1
#5 0x0000007f8926e280 in ?? () from /usr/lib/libcuda.so.1
#6 0x0000007f8937fefc in cuLaunchKernel () from /usr/lib/libcuda.so.1
#7 0x0000007f8dc782ac in ?? () from /usr/local/cuda-10.2/targets/aarch64-linux/lib/libcudart.so.10.2
#8 0x0000000000000100 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

The segfault happens only, and it’s sometimes in one thread, and sometimes in the other thread. I.e. it’s not always the same thread that segfaults.

For the record, I have followed these guidelines about thread-safety:
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#thread-safety

I have:

Ensured that each thread creates their own execution context (via engine_->createExecutionContext()).
Ensured the logger is a global variable and is thread-safe.

What is the problem? How can I debug this further?

I currently cannot share a reproducible example since 1) it’s not perfectly reproducible and 2) it’s proprietary.

Thanks!

AastaLLL · June 11, 2020, 2:10am

Hi,

It looks like you are using our DRIVE platform but here is a Jetson forum.
Please file your issue here to get a proper support:

Thanks.

carlos.galvez · June 11, 2020, 6:46am

Thanks, I don’t know how it ended up in the wrong Forum. I meant to post in the “TensorRT Forum”, I think it’s more suitable.

I moved it now, feel free to delete this post!