Jetson Nano TensorRT heisenbug

namezis · January 18, 2021, 10:53pm

We are using jetson nano on the project, several dozen pieces. A Triton server is installed on the Jetsons to run models in tensorrt format. All jetsons are one-of-a-kind, with the same power supplies, sd cards, housings, ventilators, and software.

But on some jetsons after a random time (from seconds to hours) the Triton server stops working with the following error:

E1229 07:59:55.644798 59 logging.cc:43] …/rtSafe/safeContext.cpp (133) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)

dmesg:

[ 3578.073719] nvgpu: 57000000.gpu gk20a_fifo_handle_mmu_fault_locked:1721 [ERR] fake mmu fault on engine 0, engine subid 0 (gpc), client 17 (prop 2), addr 0x4d112000, type 9 (work creation), access_type 0x00000000,inst_ptr 0x51d514000
[ 3578.094400] nvgpu: 57000000.gpu gk20a_fecs_dump_falcon_stats:129 [ERR] gr_fecs_os_r : 0

[ 3578.397083] nvgpu: 57000000.gpu gk20a_fecs_dump_falcon_stats:169 [ERR] FECS_FALCON_REG_IMB : 0xbadfbadf

[ 3578.536579] nvgpu: 57000000.gpu fifo_error_isr:2605 [ERR] channel reset initiated from fifo_error_isr; intr=0x00000100

L4T version:
R32 (release), REVISION: 4.3

AastaLLL · January 19, 2021, 3:34am

Hi,

To give a further suggestion, could you share the detailed steps to reproduce this error in our environment.
Please also share the failure rate and the (roughly) reproduced time with us.

Thanks.

namezis · January 19, 2021, 8:12am

Steps:
Install triton server 2.0.0 for jetpack
Release Release 2.0.0 corresponding to NGC container 20.06 · triton-inference-server/server · GitHub , add model for detect human, in attached, libnvinfer_plugin.so (17.5 MB)
command for convert
export LD_PRELOAD=/<path_to_libnvinfer_plugin>/libnvinfer_plugin.so &&
trtexec --uff=up_v32_224_400.uff --output=NMS --fp16 --uffInput=Input,3,224,400 --saveEngine=modelup.plan

up_v32_224_400.uff (12.0 MB)
modelup.plan (8.4 MB)

start triton with command:
/opt/triton/bin/tritonserver --model-repository=/models --strict-model-config=false --min-supported-compute-capability 5

Time to reproduce -
For some devices, a day, for some from 10 minutes.

AastaLLL · January 21, 2021, 5:21am

Thanks for the data.
Will get back to you later.

AastaLLL · January 21, 2021, 7:41am

Hi,

Just want to confirm.
Could we reproduce on the server-side app only?
Or need some requests from the client?

Thanks.

namezis · January 21, 2021, 8:10am

We managed to repeat the error, on the inference of zero values, a white image and random data. We did it on the same Jetson.

namezis · January 25, 2021, 2:24pm

Additional information, on a sample of 50 Jetsons, the error occurs at 27.

AastaLLL · January 26, 2021, 8:29am

Hi,

Sorry for the late update.
Could you also share the client side command or source with us?

Thanks.

namezis · January 26, 2021, 4:26pm

perf_client (7.9 MB) It turned out to be repeated using perf_client from the release from the Triton version 2.0 repository. “./perf_client -m detection” command executed multiple times.

tritonserver E0126 16:22:09.700867 48 logging.cc:43] …/rtSafe/safeContext.cpp (133) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)
tritonserver E0126 16:22:09.701218 48 logging.cc:43] FAILED_EXECUTION: std::exception

Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: pinned buffer: failed to perform CUDA copy: the launch timed out and was terminated

AastaLLL · January 28, 2021, 6:44am

Hi,

Thanks for the sharing.

We are trying to reproduce this issue internally. (multiple client connection)
Will let you know about the following.

AastaLLL · February 1, 2021, 9:07am

Hi,

We try to reproduce this on multiple Nano with connection > 50 times.
But cannot reproduce the error as you shared.

Guess that this error might be related to the connection pattern.
Would you mind sharing more details about how do you connect the server with us?
(ex. multiple connections at the same time?)

Thanks.

AastaLLL · February 3, 2021, 7:33am

Hi,

Could you also share your Nano board type with us?

Thanks.

namezis · February 7, 2021, 7:08pm

Hi,
board A02 (P3448-0000)

namezis · February 7, 2021, 8:08pm

I managed to figure out the error. The link to a similar case helped a lot. TensorRT inference context in ROS callback - #11 by ec2020 The error happened due to multithreading and CUDA. We ran the Triton server in a docker container and specified the CPU limits for the container.

tritonserver:
    build: ./TritonServer
    network_mode: host
    container_name: tritonserver
    privileged: true
    restart: unless-stopped
    cpu_quota: 100000
    shm_size: '1gb'
    ulimits:
      memlock: -1
      stack: 67108864
    volumes:
      - 'video-data:/videos'
      - 'models-data:/models'

For unknown reasons, in some jetsons everything worked perfectly, in some after a while this error occurred. Everything turned out to be fixed by removing completely the limits on the container and updating the jetson to the latest version of jetpack 4.5. Thanks for the help!

darshancganji12 · March 16, 2021, 6:10pm

Hi @namezis,

I am facing the exact same error while serving in Jetson. I am assuming it may be an issue related to the CUDA context(not exactly sure), can you please elaborate on how you solved the same issue. Would be a great help from yourside.

Thanks,
Darshan

jan.nitschke1 · April 29, 2021, 8:49am

Hey @darshancganji12 ,

maybe this is helpful: I was facing a similar issue in a scenario where I needed both TRT and TensorFlow on the same project. Making sure import pycuda.autoinitis being executed before any TensorFlow resources are allocated helped for me. For more see Tensorrt in Python: Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR) · Issue #303 · NVIDIA/TensorRT · GitHub