Bad results running uff classifier (mobilenet) with deepstream

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)
Jetson TX2
• DeepStream Version
5.0-20.07
• JetPack Version (valid for Jetson only)
4.4 [L4T 32.4.3]
• TensorRT Version
7.1.3.0

I have converted a tensorflow mobilenet network to an uff model using the following procedure:

  1. Create compatible trt tensorflow graph_def using the tf_trt_models code.

  2. convert to uff using the code:

    _ = uff.from_tensorflow(
    graph_def,
    output_nodes=output_names,
    output_filename=“mobilenet.uff”,
    text=True,
    debug_mode=True,
    )

  3. Create engine (.bin) file using the code:

     with trt.Builder(
             TRT_LOGGER
         ) as builder, builder.create_network() as network, trt.UffParser() as parser:
             builder.max_workspace_size = 1 << 28
             builder.max_batch_size = 1
             builder.fp16_mode = True
    
         parser.register_input("input", (3, 224, 224))
         for output_name in output_names:
             print(f"Registered output {output_name}")
             parser.register_output(output_name)
         parser.parse("mobilenet.uff", network)
         engine = builder.build_cuda_engine(network)
    
         buf = engine.serialize()
         with open("mobilenet.bin", "wb") as f:
             f.write(buf)
    

Then, I have tested the model-engine-file (mobilenet.bin) using this python code:

class TrtMobilenet(object):
    def _load_engine(self):
        with open(self.model_path, "rb") as f, trt.Runtime(self.trt_logger) as runtime:
            return runtime.deserialize_cuda_engine(f.read())

    def _create_context(self):
        for binding in self.engine:
            size = (
                trt.volume(self.engine.get_binding_shape(binding))
                * self.engine.max_batch_size
            )
            host_mem = cuda.pagelocked_empty(size, np.float32)
            cuda_mem = cuda.mem_alloc(host_mem.nbytes)
            self.bindings.append(int(cuda_mem))
            if self.engine.binding_is_input(binding):
                self.host_inputs.append(host_mem)
                self.cuda_inputs.append(cuda_mem)
            else:
                self.host_outputs.append(host_mem)
                self.cuda_outputs.append(cuda_mem)
        return self.engine.create_execution_context()

    def __init__(self, model_path, input_shape):
        """Initialize TensorRT plugins, engine and conetxt."""
        self.model_path = model_path
        self.input_shape = input_shape
        self.trt_logger = trt.Logger(trt.Logger.INFO)
        self.engine = self._load_engine()

        self.host_inputs = []
        self.cuda_inputs = []
        self.host_outputs = []
        self.cuda_outputs = []
        self.bindings = []
        self.stream = cuda.Stream()
        self.context = self._create_context()

    def __del__(self):
        """Free CUDA memories."""
        del self.stream
        del self.cuda_outputs
        del self.cuda_inputs

    def read(self, path):
        """Read and resize image."""
        img = Image.open(path).resize(self.input_shape)
        return np.asarray(img)

    def preprocess(self, img):
        img = img.transpose((2, 0, 1)).astype(np.float32)
        # no need normalization
        # img *= 2.0 / 255.0
        # img -= 1.0
        return img

    def detect(self, path):
        """Detect objects in the input image."""
        img_resized = self.read(path)
        img_resized = self.preprocess(img_resized)
        np.copyto(self.host_inputs[0], img_resized.ravel())

        cuda.memcpy_htod_async(self.cuda_inputs[0], self.host_inputs[0], self.stream)
        self.context.execute_async(
             batch_size=1, bindings=self.bindings, stream_handle=self.stream.handle
        )
        cuda.memcpy_dtoh_async(self.host_outputs[0], self.cuda_outputs[0], self.stream)
        self.stream.synchronize()

       output = self.host_outputs[0]
    
       return img_resized, output

model = TrtMobilenet("mobilenet.bin", (224, 224))
img, scores = model.detect("frame.jpg")

It works as expected, returning the exact same results as the original tensorflow model.

Finally, I have integrated this model into DeepStream using the following pipeline:

gst-launch-1.0 multifilesrc location=${images} caps="image/jpeg,framerate=1/1" ! \
  jpegparse ! \
  nvv4l2decoder ! \
  nvvideoconvert ! \
  'video/x-raw(memory:NVMM),format=(string)NV12' ! \
  mux.sink_0 nvstreammux live-source=0 name=mux batch-size=1 width=224 height=224 ! \
  nvinfer config-file-path=mobilenet.txt batch-size=1 process-mode=1 ! \
  nvstreamdemux name=demux demux.src_0 ! \
  nvvideoconvert ! \
  nvdsosd ! \
  nvvideoconvert ! \
  nvv4l2h265enc ! \
  h265parse ! \
  qtmux ! \
  filesink location=detections.mp4

and its corresponding mobilenet.txt configuration file:

[property]
gpu-id=0
net-scale-factor=1.0
uff-file=mobilenet.uff
model-engine-file=mobilenet.bin
input-dims=3;224;224;0
uff-input-blob-name=input
output-blob-names=scores
labelfile-path=labels.txt
num-detected-classes=2
batch-size=2
model-color-format=1
network-mode=2
is-classifier=1
process-mode=1
classifier-async-mode=0
classifier-threshold=0.
operate-on-gie-id=1
gie-unique-id=4
#parse-classifier-func-name=NvDsInferClassiferParseCustomSoftmax
#custom-lib-path=/opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_infercustomparser.so

The results (softmax probabilities) are different and wrong. What am I doing wrong?

3 Likes

Hi,

It seems that the color format between image and network are the different.

The configure file set the network color mode into BGR.
https://docs.nvidia.com/metropolis/deepstream/plugin-manual/index.html

Property Meaning Type and Range Example
model-color-format Color format required by the model Integer
0: RGB
1: BGR
2: GRAY
model-color-format=0

But the input image format is NV12.

Would you mind to change the output color of nvvideoconvert into BGRx and try it again?

Thanks.

Hi,

Yes, actually I made a typo, the model format is RGB. Therefore, I have set model-color-format=0. Moreover, following your advice I have changed the color of nvvideoconvert to RGBA, so the pipeline looks like this:

  gst-launch-1.0 multifilesrc location=${images} caps="image/jpeg,framerate=1/1" ! \
  jpegparse ! \
  nvv4l2decoder ! \
  nvvideoconvert ! \
  'video/x-raw(memory:NVMM),format=(string)RGBA' ! \
  mux.sink_0 nvstreammux live-source=0 name=mux batch-size=1 width=224 height=224 ! \
  nvinfer config-file-path=/usr/share/pt/data/classifiers/classifier.txt batch-size=1 process-mode=1 ! \
  nvstreamdemux name=demux demux.src_0 ! \
  nvvideoconvert ! \
  nvdsosd ! \
  nvvideoconvert ! \
  nvv4l2h265enc ! \
  h265parse ! \
  qtmux ! \
  filesink location=detections.mp4

The results are better but they are not the same… These are the softmax probabilities I get for 20 jpg images, for both the python code and DeepStream:

python class 0 python class 1 DeepStream class 0 DeepStream class 1
frame_00.jpg 1 0 0.999512 0.000725
frame_01.jpg 0.847167 0.152832 0.486816 0.513184
frame_02.jpg 0.998046 0.001985 0.999023 0.001146
frame_03.jpg 0.998535 0.001543 0.999512 0.000625
frame_04.jpg 0.995117 0.004680 0.999512 0.000708
frame_05.jpg 1 0 1 0
frame_06.jpg 0.625976 0.373779 0.343750 0.656250
frame_07.jpg 0.997558 0.002470 0.998047 0.001932
frame_08.jpg 0.985839 0.013938 0.985840 0.014206
frame_09.jpg 1 0 0.992676 0.007107
frame_10.jpg 0.870605 0.129638 0.590820 0.409180
frame_11.jpg 0.961425 0.038818 0.695801 0.303955
frame_12.jpg 0.969238 0.030563 0.912598 0.087280
frame_13.jpg 0.999023 0.001027 0.999023 0.001186
frame_14.jpg 0.500488 0.499511 0.367920 0.631836
frame_15.jpg 0.018646 0.981445 0.023148 0.977051
frame_16.jpg 0.019958 0.979980 0.011093 0.988770
frame_17.jpg 0.020172 0.979980 0.024063 0.976074
frame_18.jpg 0.036224 0.963867 0.010635 0.989258
frame_19.jpg 0.097412 0.902832 0.036041 0.963867
frame_20.jpg 0.180419 0.819824 0.088318 0.911621

As long as the model is confident and the probabilities are close to 1 and 0, the results for both DeepStream and python are quite similar. On the other hand, probabilities in the middle range are far away from each other.

I do not know, maybe there are still small differences between the image arrays fed into the python model engine and the DeepStream model engine.

Do you think it is something related to differences in jpg decoding between pillow and jpegparse ! nvv4l2decoder ?

Do you think there is something missing in the normalization? I have also tried to set net-scale-factor=1.0 and offsets=0;0;0.

Thank you in advance.

1 Like

Hi,

This may happen if the preprocessing stage in Deepstream and python has some difference.

Would you mind to share a reproducible source with us?
We want to reproduce this and check it further before giving next suggestion.

Thanks.

Hi,

I can share a reproducible source (a docker image) to you privately.

Thank you.

You can try changing the net-scale-factor. For ssd mobilenet v2 that I trained using Tensorflow Object Detection API, I use net-scale-factor=0.03 which gives same detection with deepstream as in the pc.

Yes, thank you. I just tried it, and the results were different and bad :(

Actually, I also use SSD mobilenet v1 (Tensorflow Object Detection API) with net-scale-factor=0.0078431372 (which corresponds to 2/255) and offsets=1;1;1, and it works great for me. The classification model, I am trying to fix here, is also a mobilenet v1, and I have tried a lot of combinations of net-scale-factor + offsets with no success.

Moreover, I think I do not need any preprocessing before feeding the RGB image into the network, since the required preprocessing is performed inside the tensorrt graph that I build using the tf_trt_models code.

For SSD networks (created with Tensorflow Object Detection API) , I think we need to specify the net-scale-factor and offset parameters because of this piece of code we usually use to generate the model engines:

    namespace_plugin_map = {
    ....
    "Preprocessor": Input,
    "ToFloat": Input,
    "image_tensor": Input,
    ....
}
graph.collapse_namespaces(namespace_plugin_map)

On the other hand, for my classification mobilenet I do not collapse any namespace. However, I still think that there is something wrong with the preprocessing.

Do you think it is ok to use a float32 input placeholder? maybe DeemStream is expecting an uint8 placeholder? I use it, just like tf_trt_models does:

tf_input = tf.placeholder(tf.float32, [None, net.input_height, net.input_width, 3], name=input_name)

Thank you, any help is appreciated.

Hi,

We have got the data from private message.
Will try it and reproduce this first.

Thanks.

Hi,

Doesn’t get a response from the private message for the reproducible source.
Would you please to check the message and share the data with us?

Thanks.

Hi,

Yes, sorry for my late response. You will have it in a couple of hours. I need to anonymize the test data.

Thanks.

Hi,

I sent the data in the private message. Could you reproduce my issue?

Thanks in advance.

Hi,

Thanks for your helping.

We can reproduce this issue in our environment.
And pass this problem to our internal team.

Will update more information with you once we got any feedback.

Thanks.

Hi,

Thanks for your patience.

Here is some update on this issue:
The issue comes from JPEG decoding and the color conversion to RGBA.
We are still working on the fix. I will keep you updated once we got any progress.

Thanks.

Is the JPEG decoding and the color conversion to RGBA fixed? I am having the same issues with accuracy.

Hi,

Not yet.
Will update here for any progress.

Thanks.