Very odd results when inferencing Digits TF model on Jetson TX2 with JetPack 3.3

Hello,

I created a simple network for image classification (LeNet) which was trained on MNIST data set. I just added softmax, but that does not have impact on classification. When I classify one image (let’s say “Image with number zero” $PATH/mnist/train/0/08284.png) in digits I have correct predictions ( 0 - 23%, rest is under 9%)

Class number:0 Prediction:23,17%
Class number:5 Prediction:8,54%
Class number:6 Prediction:8,54%
Class number:8 Prediction:8,54%
Class number:9 Prediction:8,54%
and the rest... (Copied from Digits Classify One Image page)

Accuracy of whole network is ±98%. So I downloaded the model and exported the model to UFF format, so I can run my own interference. When I run on Jetson build and run project “sample_uff_mnist” (In TensortRT 4.0.1.6 sources) modifies so it uses my exported UFF model and my own input image (08248.png) the results are pretty bad :

Class number:0 Prediction:9,407872%
Class number:1 Prediction:9,695744%
Class number:2 Prediction:10,561737%
Class number:3 Prediction:10,322227%
Class number:4 Prediction:9,861691%
Class number:5 Prediction:10,206078%
Class number:6 Prediction:9,662967%
Class number:7 Prediction:9,109143%
Class number:8 Prediction:11,636149%
Class number:9 Prediction:9,536383%

When I use the lenet.uff model from TensorRT sources (TensorRT-4.0.1.6/data/mnist/lenet.uff) it correctly predicted, apparently there is some problem/preprocessing on DIGITS side which I can not find out.

Is there something that I am missing? All pictures are in grayscale, so there is no pre processing but converting the images to float. I am NOT using any mean image (image nor pixel), network was trained with Mean - None. Images are 28x28, so resize is not needed as well.

I am running latest DIGITS Container nvcr.io/nvidia/digits 18.10. On Jetson TX2 running Jetpack 3.3. TensorRT is 4.0.1.6. Using C++ API.

Btw. I get the same results on x86_64 desktop with GTX1070. So It is not Jetson dependant.

Network.py from Digits

from model import Tower
from utils import model_property
import tensorflow as tf
import tensorflow.contrib.slim as slim
import utils as digits

class UserModel(Tower):

    @model_property
    def inference(self):
        x = tf.reshape(self.x, shape=[-1, self.input_shape[0], self.input_shape[1], self.input_shape[2]], name="input_node")
        # scale (divide by MNIST std)
        x = x * 0.0125
        with slim.arg_scope([slim.conv2d, slim.fully_connected],
                            weights_initializer=tf.contrib.layers.xavier_initializer(),
                            weights_regularizer=slim.l2_regularizer(0.0005)):
            model = slim.conv2d(x, 20, [5, 5], padding='VALID', scope='conv1')
            model = slim.max_pool2d(model, [2, 2], padding='VALID', scope='pool1')
            model = slim.conv2d(model, 50, [5, 5], padding='VALID', scope='conv2')
            model = slim.max_pool2d(model, [2, 2], padding='VALID', scope='pool2')
            model = slim.flatten(model)
            model = slim.fully_connected(model, 500, scope='fc1')
            model = slim.dropout(model, 0.5, is_training=self.is_training, scope='do1')
            model = slim.fully_connected(model, self.nclasses, activation_fn=None, scope='fc2')
            
            model = tf.nn.softmax(model, name="output_node")
            
            return model

    @model_property
    def loss(self):
        model = self.inference
        loss = digits.classification_loss(model, self.y)
        accuracy = digits.classification_accuracy(model, self.y)
        self.summaries.append(tf.summary.scalar(accuracy.op.name, accuracy))
        return loss

Hi,

The most common cause is that image is in different value range.

SampleUffMNIST read image into [0, 255].
Could you check the DIGITs source is in [0, 255] or [0, 1]?

Thanks.

I am unable to find relevant code to normalizing the image into range [0-1]

At https://github.com/NVIDIA/DIGITS/blob/master/digits/utils/image.py#L114

elif channels == 1:
            if image.ndim != 2:
                if image.ndim == 3 and image.shape[2] in [3, 4]:
                    # color to grayscale. throw away alpha
                    image = np.dot(image[:, :, :3], [0.299, 0.587, 0.114]).astype(np.uint8)
                else:
raise ValueError('invalid image shape: %s' % (image.shape,))

which converts the image to Grayscale with coefficients 0.299 0.587 0.114 (Does not matter since in grayscale image (155,155, 155) pixel = (155) pixel

Here https://github.com/NVIDIA/DIGITS/blob/master/digits/tools/tensorflow/tf_data.py#L278

is image loaded, mean substracted ( if mean is allowed ), cropped ( if its allowed ), augmented ( if its allowed) and created as a batch. The inference is launched with this batch.

The image is later casted to float array : https://github.com/NVIDIA/DIGITS/blob/master/digits/model/tasks/tensorflow_train.py#L513

So I removed my normalization into range [0,1]

Steps I am performing now :

cv::Mat testImage = cv::imread("/tmp/MNIST/train/0/08284.png");

testImage.convertTo(testImage, CV_32FC3);
// Does not matter if BRG or RGB, the image is grayscale so pixel R == G == B
    cv::cvtColor(testImage, testImage, cv::COLOR_BGR2GRAY);

Which I copy to GPU memory like this:

cudaMemcpy(inputOuputBuffer[0], testImage.data, 28 * 28 * 4, cudaMemcpyKind::cudaMemcpyHostToDevice);

And then running (nvinfer1::IExecutionContext* mTRTExecutionContext)

mTRTExecutionContext->execute(1, &aInputOutputBuffer[0])

Results without normalizing into range [0,1] is:

Class number:0 Prediction:1,698807%
Class number:1 Prediction:53,010906%
Class number:2 Prediction:1,472344%
Class number:3 Prediction:0,236940%
Class number:4 Prediction:1,499199%
Class number:5 Prediction:34,592514%
Class number:6 Prediction:0,710143%
Class number:7 Prediction:3,224902%
Class number:8 Prediction:3,441595%
Class number:9 Prediction:0,112647%

Which is still wrong.

In python the TF network is working as intended, but I would like to use C++ API instead.

Its simply loading the image with imread to variable image and feeding the variable to input_node of the network.

classes = self.tf_session.run(self.out_tensor, feed_dict={"val/model/input_node":image[None,...]})

I was observing all preprocessing in DIGITS source code but it do use range 0-255.

I have to be missing something which I can not find out. Is there any example of inferencing TF model trained by DIGITS in C++ (TensorRT)?

I can not find out how the network from UFFSampleMnist (lenet5.uff) differ from TF LeNet network trained by DIGITS on MNIST dataset.

I find out that the input node from digits has

nodes {
    id: "val/model/input_node"
    operation: "Input"
    fields {
      key: "dtype"
      value {
        dtype: DT_FLOAT32
      }
    }
    fields {
      key: "shape"
      value {
        i_list {
        }
      }
    }
  }

but lenet5.uff from UffSampleMNIST has

nodes {
    id: "Input_0"
    operation: "Input"
    fields {
      key: "dtype"
      value {
        dtype: DT_FLOAT32
      }
    }
    fields {
      key: "shape"
      value {
        i_list {
          l: 28
          l: 28
          l: 1
        }
      }
    }
  }

Hi,

Pillow and OpenCV use different x, y coordinate.
Will this cause a issue?

If not, could you share a simple .pb file for us checking?
Thanks.

Thank you for helping me.

Yes I am aware that PIL coords and Python coords differ but sadly thats not the issue.

Of course. Its model which was trained by DIGITS (tensorflow) the only change is using No Mean and grayscale images. I just assigned name for input and output tensor which is now:

Input tensor : “val/model/input_node”
Output tensor : “val/model/output_node”

And added softmax as output node but you can use “val/model/fc2/BiasAdd” for raw Digits Output.

I attached test image. Correct predictions are in this topis in first post.

Here is whole model from Digits + Test Image - [url]TFModel-NvidiaDigits-NoMean-MNIST.tar - Google Drive

When I use lenet5.uff from UFFSampleMnist in my C++ code (and convert 0-255 to 0-1) it works exactly same as in UFFSampleMnist.

Hi,

Thanks for sharing your model with us.
We will check this issue internally and update information with you later.

Hi,

It looks like the main difference of the LeNet model is the output node.
Could you try to use “val/model/fc2/BiasAdd” as output node rather than “val/model/output_node”?

Or update this layer?

model = slim.fully_connected(model, self.nclasses, activation_fn=None, scope='fc2', <b>name="output_node"</b>)

Thanks.

I tried this already.

When I used as output node the results are following :

[2018-12-07 10:56:22.420] [UnitTest] [debug] Class number:0 Prediction:0,0169881
[2018-12-07 10:56:22.421] [UnitTest] [debug] Class number:1 Prediction:0,530109
[2018-12-07 10:56:22.421] [UnitTest] [debug] Class number:2 Prediction:0,0147234
[2018-12-07 10:56:22.421] [UnitTest] [debug] Class number:3 Prediction:0,0023694
[2018-12-07 10:56:22.421] [UnitTest] [debug] Class number:4 Prediction:0,014992
[2018-12-07 10:56:22.421] [UnitTest] [debug] Class number:5 Prediction:0,345925
[2018-12-07 10:56:22.421] [UnitTest] [debug] Class number:6 Prediction:0,00710143
[2018-12-07 10:56:22.421] [UnitTest] [debug] Class number:7 Prediction:0,032249
[2018-12-07 10:56:22.421] [UnitTest] [debug] Class number:8 Prediction:0,0344159
[2018-12-07 10:56:22.421] [UnitTest] [debug] Class number:9 Prediction:0,00112647

Which is the same.

So I tried to create new network in digits but I realised that DIGITS adding softmax to the network too. Also slim.fully_connected does not allow passing name as parameter.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/layers.py#L1769

Here DIGITS adding softmax layer

https://github.com/NVIDIA/DIGITS/blob/7a3d5f00f3ef0e81cdc3415b03c6ede98c3ef91c/digits/tools/tensorflow/main.py#L331

The previous results in this topic were multiplied by 100 to get %.

I am sorry I apparently did not change the output node, i just removed multiplication by 100.

I started over to see if I did something wrong…

I made new network dataset mnist grayscale 28x28 noMean.

Now I know that DIGITS is adding softmax layer so I can not add softmax layer as in first post. It will worse predictions. So I created network like that (digits wont be using softmax layer, I will be in C++/Python):

from model import Tower
from utils import model_property
import tensorflow as tf
import tensorflow.contrib.slim as slim
import utils as digits

class UserModel(Tower):

    @model_property
    def inference(self):
        x = tf.reshape(self.x, shape=[-1, self.input_shape[0], self.input_shape[1], self.input_shape[2]], name='input_node')
        # scale (divide by MNIST std)
        x = x * 0.0125
        with slim.arg_scope([slim.conv2d, slim.fully_connected],
                            weights_initializer=tf.contrib.layers.xavier_initializer(),
                            weights_regularizer=slim.l2_regularizer(0.0005)):
            model = slim.conv2d(x, 20, [5, 5], padding='VALID', scope='conv1')
            model = slim.max_pool2d(model, [2, 2], padding='VALID', scope='pool1')
            model = slim.conv2d(model, 50, [5, 5], padding='VALID', scope='conv2')
            model = slim.max_pool2d(model, [2, 2], padding='VALID', scope='pool2')
            model = slim.flatten(model)
            model = slim.fully_connected(model, 500, scope='fc1')
            model = slim.dropout(model, 0.5, is_training=self.is_training, scope='do1')
            model = slim.fully_connected(model, self.nclasses, activation_fn=None, scope='fc2')
            
            tf.nn.softmax(model, name='softmax_layer')

            return model

    @model_property
    def loss(self):
        model = self.inference
        loss = digits.classification_loss(model, self.y)
        accuracy = digits.classification_accuracy(model, self.y)
        self.summaries.append(tf.summary.scalar(accuracy.op.name, accuracy))
        return loss

The correct result from DIGITS are:

99.74% 0
0.26%  6
0.0%   5
0.0%   8
0.0%   9

Created python inference script and the results are (softmax_layer):

[ 0 = 9.9740511e-01
  1 = 1.8162309e-07 
  2 = 2.9434251e-08
  3 = 2.7119813e-09
  4 = 7.8573720e-08
  5 = 1.1273847e-05
  6 = 2.5816681e-03
  7 = 1.6479396e-09
  8 = 1.0915338e-06
  9 = 5.1449058e-07]

Run with val/model/fc2/BiasAdd as output node

[
0 = 14.050947   
1 = -1.4677866  
2 = -3.2875614  
3 = -5.67204    
4 = -2.3056836   
5 = 2.660521
6 = 8.094226   
7 = -6.170194    
8 = 0.32561916 
9 = -0.42654312]

Which is almost the same.

Python image preprocessing:

image = cv2.imread("/home/jakub/Prace/Digits/data/imageClass/mnist/train/0/08284.png")
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

image = np.expand_dims(image, -1)
image = image.astype(float)
self.tf_session.run(self.out_tensor, feed_dict={self.image_tensor:image[None,...]})

Now C++ side:

This is tricky and can not understand why it works like that… So I will split the following into three parts

Converting PB > UFF

tf.graph_util.remove_training_nodes(
    self.MODEL_GRAPH
    )

    self.MODEL_GRAPH = optimize_for_inference_lib.optimize_for_inference(
        input_graph_def=self.MODEL_GRAPH,
        input_node_names=self.NODES_INPUT,
        output_node_names=self.NODES_OUTPUT,
        placeholder_type_enum=tf.float32.as_datatype_enum
        )
  1. First Experiment

Run converting PB > UFF with self.NODES_OUTPUT = val/model/fc2/BiasAdd

C++ side #1:

mUffParser->registerOutput("val/model/fc2/BiasAdd")

Results:

[2018-12-10 11:50:16.322] [UnitTest] [debug] Class number:0 Prediction:4,68173
[2018-12-10 11:50:16.322] [UnitTest] [debug] Class number:1 Prediction:2,7936
[2018-12-10 11:50:16.322] [UnitTest] [debug] Class number:2 Prediction:2,91162
[2018-12-10 11:50:16.322] [UnitTest] [debug] Class number:3 Prediction:-7,49797
[2018-12-10 11:50:16.322] [UnitTest] [debug] Class number:4 Prediction:5,32086
[2018-12-10 11:50:16.322] [UnitTest] [debug] Class number:5 Prediction:0,424044
[2018-12-10 11:50:16.322] [UnitTest] [debug] Class number:6 Prediction:4,6629
[2018-12-10 11:50:16.322] [UnitTest] [debug] Class number:7 Prediction:1,66149
[2018-12-10 11:50:16.322] [UnitTest] [debug] Class number:8 Prediction:1,61851
[2018-12-10 11:50:16.322] [UnitTest] [debug] Class number:9 Prediction:-3,12583

These are bad results…

  1. Second Experiment

Run converting PB > UFF with self.NODES_OUTPUT = val/model/softmax_layer

C++ Side #1

mUffParser->registerOutput("val/model/softmax_layer")

Results:

[2018-12-10 11:58:09.672] [UnitTest] [debug] Class number:0 Prediction:0,23213
[2018-12-10 11:58:09.672] [UnitTest] [debug] Class number:1 Prediction:0,0351339
[2018-12-10 11:58:09.672] [UnitTest] [debug] Class number:2 Prediction:0,0395349
[2018-12-10 11:58:09.672] [UnitTest] [debug] Class number:3 Prediction:1,19166e-06
[2018-12-10 11:58:09.672] [UnitTest] [debug] Class number:4 Prediction:0,439845
[2018-12-10 11:58:09.672] [UnitTest] [debug] Class number:5 Prediction:0,00328579
[2018-12-10 11:58:09.672] [UnitTest] [debug] Class number:6 Prediction:0,2278
[2018-12-10 11:58:09.672] [UnitTest] [debug] Class number:7 Prediction:0,0113255
[2018-12-10 11:58:09.672] [UnitTest] [debug] Class number:8 Prediction:0,010849
[2018-12-10 11:58:09.672] [UnitTest] [debug] Class number:9 Prediction:9,43954e-05

C++ Side #2

mUffParser->registerOutput("val/model/fc2/BiasAdd")

Results:

[2018-12-10 11:55:24.996] [UnitTest] [debug] Class number:0 Prediction:0,2321
[2018-12-10 11:55:24.996] [UnitTest] [debug] Class number:1 Prediction:0,0351339
[2018-12-10 11:55:24.996] [UnitTest] [debug] Class number:2 Prediction:0,0395349
[2018-12-10 11:55:24.996] [UnitTest] [debug] Class number:3 Prediction:1,19166e-06
[2018-12-10 11:55:24.996] [UnitTest] [debug] Class number:4 Prediction:0,439845
[2018-12-10 11:55:24.996] [UnitTest] [debug] Class number:5 Prediction:0,00328579
[2018-12-10 11:55:24.996] [UnitTest] [debug] Class number:6 Prediction:0,2278
[2018-12-10 11:55:24.996] [UnitTest] [debug] Class number:7 Prediction:0,0113255
[2018-12-10 11:55:24.996] [UnitTest] [debug] Class number:8 Prediction:0,010849
[2018-12-10 11:55:24.996] [UnitTest] [debug] Class number:9 Prediction:9,43954e-05

Notice that results are same, so its not dependant on which output node I register with UffParser.

C++ Side #3:

I removed registerOutput call

Results:

[2018-12-10 12:00:26.853] [UnitTest] [debug] Class number:0 Prediction:0,23213
[2018-12-10 12:00:26.853] [UnitTest] [debug] Class number:1 Prediction:0,0351339
[2018-12-10 12:00:26.853] [UnitTest] [debug] Class number:2 Prediction:0,0395349
[2018-12-10 12:00:26.853] [UnitTest] [debug] Class number:3 Prediction:1,19166e-06
[2018-12-10 12:00:26.853] [UnitTest] [debug] Class number:4 Prediction:0,439845
[2018-12-10 12:00:26.853] [UnitTest] [debug] Class number:5 Prediction:0,00328579
[2018-12-10 12:00:26.853] [UnitTest] [debug] Class number:6 Prediction:0,2278
[2018-12-10 12:00:26.853] [UnitTest] [debug] Class number:7 Prediction:0,0113255
[2018-12-10 12:00:26.853] [UnitTest] [debug] Class number:8 Prediction:0,010849
[2018-12-10 12:00:26.853] [UnitTest] [debug] Class number:9 Prediction:9,43954e-05

Same, as previous. For what is then function registerOutput from UffParser?

I do not get why are results from python correct and C++ is struggling to get any correct results.

Link for model : https://drive.google.com/open?id=1R5xC1vm8xPbU1KL8TcHiwhKYtNExyoBv

I made new model with works with Color Images, the prediction in digits is 99,78%.

So I did research on data formats and apparently OpenCV is using NHWC, DIGITS is using NHWC, TF is using NHWC but TRT uses NCHW.

So I did preprocessing on loaded image so I get it into NCHW format →

qTestImage.load("/Digits/data/imageClass/mnist/train/0/08284.png");

    qTestImage = qTestImage.scaled(28, 28);

    float imageNCHW[28*28*3];

    for(int y = 0; y < 28; y++)
        {
            for(int x = 0; x < 28; x++)
            {
                const QRgb rgb  = qTestImage.pixel(y,x);
                const float3 px = make_float3(float(qBlue(rgb)),
                                              float(qGreen(rgb)),
                                              float(qRed(rgb)));

                imageNCHW[(28 * 28) * 0 + y * 28 + x] = px.x;
                imageNCHW[(28 * 28) * 1 + y * 28 + x] = px.y;
                imageNCHW[(28 * 28) * 2 + y * 28 + x] = px.z;
            }
    }

But the results are again wrong, but the image is in right format, output node was val/model/fc2/BiasAdd.

It does not matter which output node I register via C++ UFFParser API and Dows not matter which UFFInputOrder I select, the result is always the same…

Okay so…

  1. I did not found any use case of function registerOutput when running UFF model.
  2. I did not found any use case of parameter UffInputOrder of funstion registerInput, kNCHW and kNHWC makes no difference

Lets back to the Initial Problem. I found the culprit!

Its problem in TensorRT (TRT 4). Default DIGITS network is using slim. Thats not a problem BUT the operation flatten is not properly/is not available in TensorRT.

model = slim.flatten(model)

So the flatten operation is converted during conversion (from PB to UFF) to reshape operation. But this conversion is not right!.

In LeNet case I changed it to:

model = tf.reshape(model, [-1, 800])

After I replaced flatten operation with tf.reshape the network works in C++ as intended!

Results:

[2018-12-12 15:05:07.485] [UnitTest] [debug] -->0 = 2,79693e-08
[2018-12-12 15:05:07.485] [UnitTest] [debug] -->1 = 4,20164e-06
[2018-12-12 15:05:07.485] [UnitTest] [debug] -->2 = 1,08377e-08
[2018-12-12 15:05:07.485] [UnitTest] [debug] -->3 = 0,963902
[2018-12-12 15:05:07.485] [UnitTest] [debug] -->4 = 1,92564e-09
[2018-12-12 15:05:07.485] [UnitTest] [debug] -->5 = 0,0360398
[2018-12-12 15:05:07.485] [UnitTest] [debug] -->6 = 1,2714e-10
[2018-12-12 15:05:07.485] [UnitTest] [debug] -->7 = 5,43027e-07
[2018-12-12 15:05:07.485] [UnitTest] [debug] -->8 = 2,48788e-05
[2018-12-12 15:05:07.485] [UnitTest] [debug] -->9 = 2,86922e-05

Digits Results:

96.39% 3
3.6%   5
0.0%   9
0.0%   8
0.0%   1

Cool! Thanks for sharing it with us.