Nondeterministic predicts from nvinferserver via gRPC

• Hardware Platform: Jetson Xavier / Jetson Nano / GTX 1060 / RTX 3060Ti
• DeepStream Version: nvcr.io/nvidia/deepstream:6.0.1-triton, nvcr.io/nvidia/deepstream:6.1-triton
• JetPack Version: Jetpack4.6
• NVIDIA GPU Driver Version: 510.73.05

Based on deepstream-test1, I wrote code that can reproduce the bug.
In the deepstream_test1_app.c file, the osd_sink_pad_buffer_probe function has been changed as follows:

static GstPadProbeReturn
osd_sink_pad_buffer_probe (GstPad * pad, GstPadProbeInfo * info,
   gpointer u_data)
{
   GstBuffer *buf = (GstBuffer *) info->data;
   NvDsObjectMeta *obj_meta = NULL;
   NvDsMetaList * l_frame = NULL;
   NvDsMetaList * l_obj = NULL;
   NvDsDisplayMeta *display_meta = NULL;

   NvDsBatchMeta *batch_meta = gst_buffer_get_nvds_batch_meta (buf);

   for (l_frame = batch_meta->frame_meta_list; l_frame != NULL; l_frame = l_frame->next) {
       NvDsFrameMeta *frame_meta = (NvDsFrameMeta *) (l_frame->data);
       for (l_obj = frame_meta->obj_meta_list; l_obj != NULL; l_obj = l_obj->next) {
           obj_meta = (NvDsObjectMeta *) (l_obj->data);
           if (obj_meta->class_id == PGIE_CLASS_ID_VEHICLE) {
             printf("frame = %d, class vehicle, left = %f, top = %f , width = %f, height = %f\n",
                    frame_number, obj_meta->rect_params.left, obj_meta->rect_params.top,
                    obj_meta->rect_params.width, obj_meta->rect_params.height);
           }
           if (obj_meta->class_id == PGIE_CLASS_ID_PERSON) {
             printf("frame = %d, class person, left = %f, top = %f , width = %f, height = %f\n",
                    frame_number, obj_meta->rect_params.left, obj_meta->rect_params.top,
                    obj_meta->rect_params.width, obj_meta->rect_params.height);
           }
       }
   }
   frame_number++;
   return GST_PAD_PROBE_OK;
}

as well as the definition of pgie and sink elements:

pgie = gst_element_factory_make ("nvinferserver", "primary-nvinference-engine");
sink = gst_element_factory_make ("fakesink", "nvvideo-renderer");

File dstest1_pgie_config.txt has been completely redone:

infer_config {
 unique_id: 1
 gpu_ids: 0
 max_batch_size: 30
 backend {
   inputs: [ {
     name: "input_1"
   }]
   outputs: [
     {name: "conv2d_bbox"},
     {name: "conv2d_cov/Sigmoid"}
   ]
   triton {
     model_name: "Primary_Detector"
     version: -1
     model_repo {
       root: "/models/"
         log_level: 0
         strict_model_config: true
     }
   }
 }

 preprocess {
   network_format: MEDIA_FORMAT_NONE
   tensor_order: TENSOR_ORDER_LINEAR
   tensor_name: "input_1"
   maintain_aspect_ratio: 0
   frame_scaling_hw: FRAME_SCALING_HW_DEFAULT
   frame_scaling_filter: 1
   normalize {
     scale_factor: 0.0039215697906911373
     channel_offsets: [0, 0, 0]
   }
 }

 postprocess {
   detection {
     num_detected_classes: 4
     nms {
       confidence_threshold: 0.5
       iou_threshold: 0.3
       topk : 4
     }
   }
 }

 extra {
   copy_input_to_host_buffers: false
 }
}

input_control {
 process_mode: PROCESS_MODE_FULL_FRAME
 interval: 0
}

docker-compose.yml file was also added for tritonserver:

version: '3.9'
services:
 triton_server:
   restart: always
   image: nvcr.io/nvidia/tritonserver:21.08-py3
   container_name: triton_server
   command: tritonserver --model-repository=/models/
   volumes:
     - ./models:/models
   ports:
     - "8001:8001"
     - "8002:8002"
     - "8003:8003"
   deploy:
     resources:
       reservations:
         devices:
           - driver: nvidia
             device_ids: [ '0' ]
             capabilities: [ gpu ]

After docker-compose up -d in the container logs, the model status was as follows:

+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| Primary_Detector | 1       | READY  |
+------------------+---------+--------+

Now you can run the code:

nohup ./deepstream-test1-app /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 > 1.log
nohup ./deepstream-test1-app /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 > 2.log

After running the code, both log files are the same and differ only in the startup time:

root@49511ef0114c:/deepstream-test1# diff 2.log 3.log
2c2
< 2022-06-13 14:43:47.455075: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
---
> 2022-06-13 14:44:05.381798: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

Now we change dstest1_pgie_config.txt fragment from such:

model_repo {
  root: "/models/"
    log_level: 0
    strict_model_config: true
}

to such:

grpc {
   url: "triton_server_address:8001"
}

Run the same code:

nohup ./deepstream-test1-app /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 > 1_bug.log
nohup ./deepstream-test1-app /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 > 2_bug.log

we get:

root@49511ef0114c:/deepstream-test1# diff 1_bug.log 2_bug.log
17,36c17,36
< frame = 3, class vehicle, left = 544.536987, top = 475.619720 , width = 53.425827, height = 48.065083
< frame = 3, class vehicle, left = 588.732971, top = 477.273010 , width = 61.471939, height = 50.111492
< frame = 3, class vehicle, left = 615.755188, top = 495.509369 , width = 104.406647, height = 78.806992
< frame = 3, class person, left = 299.444122, top = 455.046082 , width = 176.051331, height = 362.353668
< frame = 4, class vehicle, left = 545.431519, top = 474.418884 , width = 52.260040, height = 49.574074
< frame = 4, class vehicle, left = 594.631226, top = 477.946472 , width = 68.152817, height = 49.521412
< frame = 4, class vehicle, left = 614.145813, top = 497.648163 , width = 105.005493, height = 76.422791
< frame = 4, class person, left = 298.719238, top = 456.335510 , width = 173.643143, height = 355.677856
< frame = 5, class person, left = 310.979645, top = 454.690552 , width = 156.119980, height = 402.667847
< frame = 5, class vehicle, left = 619.035522, top = 495.053131 , width = 106.797638, height = 81.174263
< frame = 5, class vehicle, left = 592.023560, top = 475.581085 , width = 73.765594, height = 50.918362
< frame = 5, class person, left = 0.000000, top = 396.647644 , width = 200.781845, height = 568.117798
< frame = 6, class vehicle, left = 619.577271, top = 493.844086 , width = 103.672165, height = 80.190193
< frame = 6, class vehicle, left = 544.962708, top = 475.137756 , width = 56.897598, height = 49.003025
< frame = 6, class vehicle, left = 578.650452, top = 476.278503 , width = 71.975006, height = 52.033951
< frame = 6, class person, left = 300.859833, top = 456.766663 , width = 171.413467, height = 355.382233
< frame = 7, class person, left = 313.783081, top = 461.919189 , width = 152.786789, height = 366.220795
< frame = 7, class vehicle, left = 545.041809, top = 476.345398 , width = 54.747253, height = 48.391716
< frame = 7, class vehicle, left = 589.668335, top = 475.500732 , width = 68.211044, height = 52.685608
< frame = 7, class person, left = 0.000000, top = 398.737000 , width = 213.900055, height = 602.954346
---
> frame = 3, class vehicle, left = 612.738770, top = 495.958832 , width = 107.338440, height = 78.978500
> frame = 3, class person, left = 295.735687, top = 454.111176 , width = 178.626022, height = 367.075012
> frame = 3, class vehicle, left = 587.028992, top = 476.702850 , width = 67.797684, height = 50.701576
> frame = 3, class vehicle, left = 543.125977, top = 476.135437 , width = 54.326981, height = 48.317467
> frame = 4, class vehicle, left = 619.577271, top = 493.844086 , width = 103.672165, height = 80.190193
> frame = 4, class vehicle, left = 544.962708, top = 475.137756 , width = 56.897598, height = 49.003025
> frame = 4, class vehicle, left = 578.650452, top = 476.278503 , width = 71.975006, height = 52.033951
> frame = 4, class person, left = 300.859833, top = 456.766663 , width = 171.413467, height = 355.382233
> frame = 5, class vehicle, left = 544.536987, top = 475.619720 , width = 53.425827, height = 48.065083
> frame = 5, class vehicle, left = 588.732971, top = 477.273010 , width = 61.471939, height = 50.111492
> frame = 5, class vehicle, left = 615.755188, top = 495.509369 , width = 104.406647, height = 78.806992
> frame = 5, class person, left = 299.444122, top = 455.046082 , width = 176.051331, height = 362.353668
> frame = 6, class vehicle, left = 545.137085, top = 474.665344 , width = 56.423080, height = 50.258690
> frame = 6, class vehicle, left = 621.677185, top = 494.147034 , width = 110.575424, height = 84.068649
> frame = 6, class vehicle, left = 588.362183, top = 475.390350 , width = 70.541199, height = 52.784439
> frame = 6, class person, left = 0.000000, top = 396.102020 , width = 196.791916, height = 557.763489
> frame = 7, class person, left = 310.979645, top = 454.690552 , width = 156.119980, height = 402.667847
> frame = 7, class vehicle, left = 619.035522, top = 495.053131 , width = 106.797638, height = 81.174263
> frame = 7, class vehicle, left = 592.023560, top = 475.581085 , width = 73.765594, height = 50.918362
> frame = 7, class person, left = 0.000000, top = 396.647644 , width = 200.781845, height = 568.117798
41,55c41,56
etc ...

If we analyze the difference between the two files, we can notice that irrelevant data may come (for example, from previous frames), the order of the data is not deterministic, the data may be skipped or come twice:

1

source.tar.xz (12.8 MB)

Is there any progress?

Sorry! Still under checking… will get back in next week

1 Like

Hello. Do you have any progress?

Hi @v.burachonak ,
Sorry for long delay! I can reproduce this, but is still debugging it.

1 Like

Hi @v.burachonak
Sorry for long delay! We are still debugging this… want to know if this issue is importaant for you now?

Hi. This issue is important for several my projects. I can’t get around this bug in any way :(

There was a bug related to partial data corruption in DS-gRPC and it will be fixed in next release soon. Before that, if you are running tritonserver and client on same machine. Please try DS-Triton-CAPI approach which has been fully tested.

1 Like

Thank you

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.