Nondeterministic predicts from nvinferserver via gRPC

v.burachonak · June 14, 2022, 9:22am

• Hardware Platform: Jetson Xavier / Jetson Nano / GTX 1060 / RTX 3060Ti
• DeepStream Version: nvcr.io/nvidia/deepstream:6.0.1-triton, nvcr.io/nvidia/deepstream:6.1-triton
• JetPack Version: Jetpack4.6
• NVIDIA GPU Driver Version: 510.73.05

Based on deepstream-test1, I wrote code that can reproduce the bug.
In the deepstream_test1_app.c file, the osd_sink_pad_buffer_probe function has been changed as follows:

static GstPadProbeReturn
osd_sink_pad_buffer_probe (GstPad * pad, GstPadProbeInfo * info,
   gpointer u_data)
{
   GstBuffer *buf = (GstBuffer *) info->data;
   NvDsObjectMeta *obj_meta = NULL;
   NvDsMetaList * l_frame = NULL;
   NvDsMetaList * l_obj = NULL;
   NvDsDisplayMeta *display_meta = NULL;

   NvDsBatchMeta *batch_meta = gst_buffer_get_nvds_batch_meta (buf);

   for (l_frame = batch_meta->frame_meta_list; l_frame != NULL; l_frame = l_frame->next) {
       NvDsFrameMeta *frame_meta = (NvDsFrameMeta *) (l_frame->data);
       for (l_obj = frame_meta->obj_meta_list; l_obj != NULL; l_obj = l_obj->next) {
           obj_meta = (NvDsObjectMeta *) (l_obj->data);
           if (obj_meta->class_id == PGIE_CLASS_ID_VEHICLE) {
             printf("frame = %d, class vehicle, left = %f, top = %f , width = %f, height = %f\n",
                    frame_number, obj_meta->rect_params.left, obj_meta->rect_params.top,
                    obj_meta->rect_params.width, obj_meta->rect_params.height);
           }
           if (obj_meta->class_id == PGIE_CLASS_ID_PERSON) {
             printf("frame = %d, class person, left = %f, top = %f , width = %f, height = %f\n",
                    frame_number, obj_meta->rect_params.left, obj_meta->rect_params.top,
                    obj_meta->rect_params.width, obj_meta->rect_params.height);
           }
       }
   }
   frame_number++;
   return GST_PAD_PROBE_OK;
}

as well as the definition of pgie and sink elements:

pgie = gst_element_factory_make ("nvinferserver", "primary-nvinference-engine");
sink = gst_element_factory_make ("fakesink", "nvvideo-renderer");

File dstest1_pgie_config.txt has been completely redone:

infer_config {
 unique_id: 1
 gpu_ids: 0
 max_batch_size: 30
 backend {
   inputs: [ {
     name: "input_1"
   }]
   outputs: [
     {name: "conv2d_bbox"},
     {name: "conv2d_cov/Sigmoid"}
   ]
   triton {
     model_name: "Primary_Detector"
     version: -1
     model_repo {
       root: "/models/"
         log_level: 0
         strict_model_config: true
     }
   }
 }

 preprocess {
   network_format: MEDIA_FORMAT_NONE
   tensor_order: TENSOR_ORDER_LINEAR
   tensor_name: "input_1"
   maintain_aspect_ratio: 0
   frame_scaling_hw: FRAME_SCALING_HW_DEFAULT
   frame_scaling_filter: 1
   normalize {
     scale_factor: 0.0039215697906911373
     channel_offsets: [0, 0, 0]
   }
 }

 postprocess {
   detection {
     num_detected_classes: 4
     nms {
       confidence_threshold: 0.5
       iou_threshold: 0.3
       topk : 4
     }
   }
 }

 extra {
   copy_input_to_host_buffers: false
 }
}

input_control {
 process_mode: PROCESS_MODE_FULL_FRAME
 interval: 0
}

docker-compose.yml file was also added for tritonserver:

version: '3.9'
services:
 triton_server:
   restart: always
   image: nvcr.io/nvidia/tritonserver:21.08-py3
   container_name: triton_server
   command: tritonserver --model-repository=/models/
   volumes:
     - ./models:/models
   ports:
     - "8001:8001"
     - "8002:8002"
     - "8003:8003"
   deploy:
     resources:
       reservations:
         devices:
           - driver: nvidia
             device_ids: [ '0' ]
             capabilities: [ gpu ]

After docker-compose up -d in the container logs, the model status was as follows:

+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| Primary_Detector | 1       | READY  |
+------------------+---------+--------+

Now you can run the code:

nohup ./deepstream-test1-app /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 > 1.log
nohup ./deepstream-test1-app /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 > 2.log

After running the code, both log files are the same and differ only in the startup time:

root@49511ef0114c:/deepstream-test1# diff 2.log 3.log
2c2
< 2022-06-13 14:43:47.455075: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
---
> 2022-06-13 14:44:05.381798: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

Now we change dstest1_pgie_config.txt fragment from such:

model_repo {
  root: "/models/"
    log_level: 0
    strict_model_config: true
}

to such:

grpc {
   url: "triton_server_address:8001"
}

Run the same code:

nohup ./deepstream-test1-app /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 > 1_bug.log
nohup ./deepstream-test1-app /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 > 2_bug.log

we get:

root@49511ef0114c:/deepstream-test1# diff 1_bug.log 2_bug.log
17,36c17,36
< frame = 3, class vehicle, left = 544.536987, top = 475.619720 , width = 53.425827, height = 48.065083
< frame = 3, class vehicle, left = 588.732971, top = 477.273010 , width = 61.471939, height = 50.111492
< frame = 3, class vehicle, left = 615.755188, top = 495.509369 , width = 104.406647, height = 78.806992
< frame = 3, class person, left = 299.444122, top = 455.046082 , width = 176.051331, height = 362.353668
< frame = 4, class vehicle, left = 545.431519, top = 474.418884 , width = 52.260040, height = 49.574074
< frame = 4, class vehicle, left = 594.631226, top = 477.946472 , width = 68.152817, height = 49.521412
< frame = 4, class vehicle, left = 614.145813, top = 497.648163 , width = 105.005493, height = 76.422791
< frame = 4, class person, left = 298.719238, top = 456.335510 , width = 173.643143, height = 355.677856
< frame = 5, class person, left = 310.979645, top = 454.690552 , width = 156.119980, height = 402.667847
< frame = 5, class vehicle, left = 619.035522, top = 495.053131 , width = 106.797638, height = 81.174263
< frame = 5, class vehicle, left = 592.023560, top = 475.581085 , width = 73.765594, height = 50.918362
< frame = 5, class person, left = 0.000000, top = 396.647644 , width = 200.781845, height = 568.117798
< frame = 6, class vehicle, left = 619.577271, top = 493.844086 , width = 103.672165, height = 80.190193
< frame = 6, class vehicle, left = 544.962708, top = 475.137756 , width = 56.897598, height = 49.003025
< frame = 6, class vehicle, left = 578.650452, top = 476.278503 , width = 71.975006, height = 52.033951
< frame = 6, class person, left = 300.859833, top = 456.766663 , width = 171.413467, height = 355.382233
< frame = 7, class person, left = 313.783081, top = 461.919189 , width = 152.786789, height = 366.220795
< frame = 7, class vehicle, left = 545.041809, top = 476.345398 , width = 54.747253, height = 48.391716
< frame = 7, class vehicle, left = 589.668335, top = 475.500732 , width = 68.211044, height = 52.685608
< frame = 7, class person, left = 0.000000, top = 398.737000 , width = 213.900055, height = 602.954346
---
> frame = 3, class vehicle, left = 612.738770, top = 495.958832 , width = 107.338440, height = 78.978500
> frame = 3, class person, left = 295.735687, top = 454.111176 , width = 178.626022, height = 367.075012
> frame = 3, class vehicle, left = 587.028992, top = 476.702850 , width = 67.797684, height = 50.701576
> frame = 3, class vehicle, left = 543.125977, top = 476.135437 , width = 54.326981, height = 48.317467
> frame = 4, class vehicle, left = 619.577271, top = 493.844086 , width = 103.672165, height = 80.190193
> frame = 4, class vehicle, left = 544.962708, top = 475.137756 , width = 56.897598, height = 49.003025
> frame = 4, class vehicle, left = 578.650452, top = 476.278503 , width = 71.975006, height = 52.033951
> frame = 4, class person, left = 300.859833, top = 456.766663 , width = 171.413467, height = 355.382233
> frame = 5, class vehicle, left = 544.536987, top = 475.619720 , width = 53.425827, height = 48.065083
> frame = 5, class vehicle, left = 588.732971, top = 477.273010 , width = 61.471939, height = 50.111492
> frame = 5, class vehicle, left = 615.755188, top = 495.509369 , width = 104.406647, height = 78.806992
> frame = 5, class person, left = 299.444122, top = 455.046082 , width = 176.051331, height = 362.353668
> frame = 6, class vehicle, left = 545.137085, top = 474.665344 , width = 56.423080, height = 50.258690
> frame = 6, class vehicle, left = 621.677185, top = 494.147034 , width = 110.575424, height = 84.068649
> frame = 6, class vehicle, left = 588.362183, top = 475.390350 , width = 70.541199, height = 52.784439
> frame = 6, class person, left = 0.000000, top = 396.102020 , width = 196.791916, height = 557.763489
> frame = 7, class person, left = 310.979645, top = 454.690552 , width = 156.119980, height = 402.667847
> frame = 7, class vehicle, left = 619.035522, top = 495.053131 , width = 106.797638, height = 81.174263
> frame = 7, class vehicle, left = 592.023560, top = 475.581085 , width = 73.765594, height = 50.918362
> frame = 7, class person, left = 0.000000, top = 396.647644 , width = 200.781845, height = 568.117798
41,55c41,56
etc ...

If we analyze the difference between the two files, we can notice that irrelevant data may come (for example, from previous frames), the order of the data is not deterministic, the data may be skipped or come twice:

source.tar.xz (12.8 MB)

v.burachonak · June 17, 2022, 11:57am

Is there any progress?

mchi · June 17, 2022, 3:49pm

Sorry! Still under checking… will get back in next week

v.burachonak · June 27, 2022, 9:14am

Hello. Do you have any progress?

mchi · June 29, 2022, 4:40pm

Hi @v.burachonak ,
Sorry for long delay! I can reproduce this, but is still debugging it.

mchi · July 14, 2022, 12:46pm

Hi @v.burachonak
Sorry for long delay! We are still debugging this… want to know if this issue is importaant for you now?

v.burachonak · July 18, 2022, 6:19am

Hi. This issue is important for several my projects. I can’t get around this bug in any way :(

windy60j34 · July 21, 2022, 12:00am

There was a bug related to partial data corruption in DS-gRPC and it will be fixed in next release soon. Before that, if you are running tritonserver and client on same machine. Please try DS-Triton-CAPI approach which has been fully tested.

v.burachonak · July 21, 2022, 5:25am

Thank you

system · August 9, 2022, 1:14pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Gst-nvinferserver with tritonserver get wrong result DeepStream SDK	3	527	December 26, 2022
DeepStream 6.0.1 Triton GRPC memory leak DeepStream SDK nvbugs	23	2861	September 2, 2022
Wrong values in model output using gRPC DeepStream SDK python , grpc , deepstream61	9	1028	August 9, 2022
Error when using Triton Server for Inference on deepstream-imagedata-example DeepStream SDK	21	1918	October 12, 2021
GRPC Data Corruption/Issue with Yolo Object Detection with Triton on Jetson DeepStream SDK	20	785	June 25, 2024
DeepStream Triton gRPC example does not run with Deepstream Triton Docker images DeepStream SDK	12	1256	January 17, 2023
Nvinfer's results are different from nvinferserver DeepStream SDK tensorrt , camera , gstreamer , nvbugs	21	1408	September 11, 2023
Custom Gst-nvinferserver post processing received wild pointer resulting in signal11 DeepStream SDK	35	1344	December 27, 2022
[error] when DeepsTream`s container using Triton Inference Server through gRPC,Segmentation fault (core dumped) DeepStream SDK	11	1125	March 9, 2022
Segmentation Error DeepStream SDK	6	2748	October 29, 2018

Nondeterministic predicts from nvinferserver via gRPC

Related topics