Incorrect Results during Inference using Tensorrt3.0 C++ uff parser

Hi,

  1. Please remember to modify the output buffer size.

  2. You can output multiple blobs at the same time.

  3. For a output blob, we also configure the blob name when parsing TensorFlow to UFF:

uff_model = uff.from_tensorflow(tf_model, ['out1','out2','out3'])

Here is a sample for your reference:

parser.register_input("data", (xx, xx, xx), 0)
parser.register_output("out1")
parser.register_output("out2")
parser.register_output("out3")

....

dims_data = engine.get_binding_dimensions(0).to_DimsCHW()
dims_out1 = engine.get_binding_dimensions(1).to_DimsCHW()
dims_out2 = engine.get_binding_dimensions(2).to_DimsCHW()
dims_out3 = engine.get_binding_dimensions(3).to_DimsCHW()

...

d_data = cuda.mem_alloc(MAX_BATCHSIZE * dims_data.C() * dims_data.H() * dims_data.W() * _data.dtype.itemsize)
d_out1 = cuda.mem_alloc(MAX_BATCHSIZE * dims_out1.C() * dims_out1.H() * dims_out1.W() * _out1.dtype.itemsize)
d_out2 = cuda.mem_alloc(MAX_BATCHSIZE * dims_out2.C() * dims_out2.H() * dims_out2.W() * _out2.dtype.itemsize)
d_out3 = cuda.mem_alloc(MAX_BATCHSIZE * dims_out3.C() * dims_out3.H() * dims_out3.W() * _out3.dtype.itemsize)

...

bindings = [int(d_data), int(d_out1), int(d_out2), int(d_out3)]

...

context.enqueue(1, bindings, stream.handle, None)

...

cuda.memcpy_dtoh_async(_out1, d_out1, stream)
cuda.memcpy_dtoh_async(_out2, d_out2, stream)
cuda.memcpy_dtoh_async(_out3, d_out3, stream)

Thanks.

Hello,

Thanks for the response as always. I think I have gotten the comparison code working, however I am still a bit confused. When I output some blobs when parsing from tensorflow to uff, I cannot access those blobs. For example when I try running the code with this configuration

uff_model=uff.from_tensorflow(frozen_graph,output_filename="/raid/nri/Classification_task/TensorRt_text_files/lenet_uff",output_nodes=['bc1','conv1_bias','out'],text=True)

uff_model=open("/raid/nri/Classification_task/TensorRt_text_files/lenet_uff",'rb').read()

G_LOGGER = trt.infer.ConsoleLogger(trt.infer.LogSeverity.ERROR)

parser = uffparser.create_uff_parser()
parser.register_input("in", (1,28,28),0)
parser.register_output("bc1")
parser.register_output("conv1_bias")
parser.register_output("out")

It gives me this error.

Traceback (most recent call last):
  File "/home/dami/TensorRt_test/CompareLayersLeNet.py", line 142, in <module>
    tf.app.run()
  File "/home/dami/tensorflow_NRI/lib/python3.4/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/dami/TensorRt_test/CompareLayersLeNet.py", line 109, in main
    d_out3 = cuda.mem_alloc(1 * dims_out3.C() * dims_out3.H() * dims_out3.W() * _out3.dtype.itemsize)
pycuda._driver.LogicError: cuMemAlloc failed: invalid argument

Which to me looks like, engine.get_binding_dimensions did not return anything for one of the registered output nodes. I noticed that if I registered layers without the operation “Const”, they would indeed work and the values from Tensorflow and TensorRT were equivalent. I am guessing this is the proper behaviour? Just wanted to make sure.

Following this idea, I implemented the same code but for my mobilenet network. However when i output and register some blobs even without the “Const” operation, I still get a similar error to the one shown above. I managed to access the first convolution layer from my input node in the mobilenet.uff namely “MobilenetV1/MobilenetV1/Conv2d_0/convolution”. This layer output an array of the same size as the one from Tensorflow but the values were different. Could this be what you were asking for? The next layer namely “MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/batchnorm/mul_1” which takes the first convolution layer as an input also gives different values from its Tensorflow equivalent. I mentioned this because in Lenet the conv1 layers were different however the conv_1 bias layers to the end of the graph yeilded the same values.

In Summary, I managed to compare a few layers between Tensorflow and TensorRT, However in the code for lenet, I could only access layers without the operation “Const”. These layers were equivalent as expected. When I wrote the same code for my mobilenet, I could not access many layers even if they didn’t have the operation: “Const”. Hence I am not sure if that has anything to do with the fact that I could access those layers in lenet. The closest layer to my input node that I could access was “MobilenetV1/MobilenetV1/Conv2d_0/convolution” but it is giving different values for the array than the one I got from tensorflow. Attached is my mobilenet.uff and mobilenetuff.pbtxt in case it is needed. I am sorry if this question is so long, I just want to make sure I can get all the information to you. Thanks a lot as always.

Hi,

To have further investigation, do you have the model definition python source?
Thanks

Hello,

I used the Tensorflow slim library to train this mobilenet network. I will attach the mobilenet.py file that contains its model definition. Let me know if you require anything else.

Thanks as always
mobilenet_v1.zip (4.82 KB)

Hi,

Your model consists of some depthwise convolution operation.
Could you check if the difference comes from this depthwise convolution operation?

1. Create an one-layer simple network:
data → depthwise-conv → output

2. Random input + all 1 weights

3. Check the output in TensorFlow and TensorRT

4. Check the sum of output in TensorFlow and TensorRT

Thanks.

Hi,

Thanks for the response. I have yet to do what you requested but I will get to it. A quick question though. If I performed the same process for a normal conv layer should it yield the same results? i.e. output from Tensorflow will be equivalent to that of TensorRT

Thanks.

Yes.

The result should be the same.

Hello,

Thanks for the response as always. I do not mean to “ignore” your suggestion to check the depthwise-conv layer however I do not believe this is the root of the problem. I have another network i.e. vgg_16 that I converted to the .uff format and it is experiencing the same issues as mobilenet. The vgg_16 network does not have a depthwise-conv layer. I will share its python source to show this.

Also it seems that the output from conv layers in all networks I have checked are not the same between Tensorflow and TensorRT. Even the lenet network that works properly on TensorRT, whenever I compare its convolutional layers between tensorflow and tensorrt, the results are different. Even though it works properly, shouldn’t all layers be the same? My comparison method is the same as the one in the code I pasted below.

I wrote a code similar to what you asked me to do for depthwise-conv but instead I did it for a normal convolution layer. In particular slim.conv2d and even tf.nn.conv2d and in both cases the output was different between Tensorflow and TensorRT. My code for this is below:

#Implement simple 1-layer network tensorflow
#Will check slim.conv2d

import tensorrt as trt
import pycuda.driver as cuda
import tensorflow as tf
import numpy as np
from tensorflow.python.framework import graph_util
import uff
from tensorrt.parsers import uffparser

slim = tf.contrib.slim

RANDOM_SEED = 42
tf.set_random_seed(RANDOM_SEED)

def sumArray(array):
    holder=0
    for i in range(len(array)):
        holder=holder+array[i]
    return holder


def isclose(a, b, rel_tol=1e-05, abs_tol=0.00003):
    return abs(a-b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol)

def compare_arrays(array1,array2):
    if(len(array1)!=len(array2)):
        return False
    for i in range(len(array1)):
        status=isclose(array1[i],array2[i])
        if(status==False):
            return False
    return True


def init_weights(shape):
    """ Weight initialization """
    weights = np.ones(shape)

    #native
    #weights=tf.ones(shape)
    return weights

def get_data():
    inputs=np.random.rand(1,28,28,3)
    return inputs

def forward_prop(inputs):
    #native
    ones=init_weights([3,3,3,32])
    net=tf.nn.conv2d(inputs,ones,strides=[1,2,2,1],padding='SAME')
    return net


    #tf-slim
    # ones=init_weights([3,3,32])
    # weights_init=tf.constant_initializer(ones)
    #
    # with slim.arg_scope([slim.conv2d], padding='VALID',weights_initializer=weights_init):
    #     net=slim.conv2d(inputs, 32,[3, 3],stride=2)
    #     return net

def main():
    train_X=get_data()

    tensorrt_input=train_X.reshape(3,28,28)

    tensorrt_input=tensorrt_input.astype(np.float32)
    X = tf.placeholder("float", shape=[1, 28, 28, 3])
    h_conv1=forward_prop(X)

    # saver = tf.train.Saver()
    init = tf.global_variables_initializer()
    sess = tf.Session()
    sess.run(init)

    tf.train.write_graph(sess.graph_def, '.', 'hellotensor.pbtxt')

    final_result=sess.run(h_conv1,feed_dict={X:train_X})

    # print(final_result)

    #saver.save(sess, './hellotensor.ckpt')

    output_graph_name='./hellotensor.pb'
    output_node_names='Conv2D'

    output_graph_def = graph_util.convert_variables_to_constants(sess,sess.graph_def,output_node_names.split(","))
    output_graph_def = tf.graph_util.remove_training_nodes(output_graph_def)

    uff_model = uff.from_tensorflow(output_graph_def, output_nodes=['Conv2D'])
    dump = open('slimConv.uff', 'wb')
    dump.write(uff_model)
    dump.close()

    # with tf.gfile.GFile(output_graph_name, "wb") as f:
    #     f.write(output_graph_def.SerializeToString())

    uff_model = open("/home/dami/TensorRt_test/slimConv.uff", 'rb').read()
    G_LOGGER = trt.infer.ConsoleLogger(trt.infer.LogSeverity.ERROR)
    parser = uffparser.create_uff_parser()
    parser.register_input("Placeholder", (3, 28, 28), 0)
    parser.register_output("Conv2D")

    engine = trt.utils.uff_to_trt_engine(G_LOGGER, uff_model, parser, 1, 1 << 20)

    parser.destroy()

    runtime = trt.infer.create_infer_runtime(G_LOGGER)
    context = engine.create_execution_context()

    dims_data = engine.get_binding_dimensions(0).to_DimsCHW()
    dims_out1 = engine.get_binding_dimensions(1).to_DimsCHW()

    _out0 = np.empty(dims_data.C() * dims_data.H() * dims_data.W(), dtype=np.float32)
    _out1 = np.empty(dims_out1.C() * dims_out1.H() * dims_out1.W(), dtype=np.float32)

    d_out0 = cuda.mem_alloc(1 * dims_data.C() * dims_data.H() * dims_data.W() * _out0.dtype.itemsize)
    d_out1 = cuda.mem_alloc(1 * dims_out1.C() * dims_out1.H() * dims_out1.W() * _out1.dtype.itemsize)

    bindings = [int(d_out0), int(d_out1)]

    stream = cuda.Stream()

    # transfer input data to device
    cuda.memcpy_htod_async(d_out0, tensorrt_input, stream)
    # execute model
    context.enqueue(1, bindings, stream.handle, None)
    # transfer predictions back
    cuda.memcpy_dtoh_async(_out1, d_out1, stream)
    # synchronize threads
    stream.synchronize()

    # re_array=_out1.reshape((13, 13, 32))

    if (_out1.shape != final_result.shape):
        results = final_result.reshape(_out1.shape)

    print(str(compare_arrays(results, _out1)))
    print(sumArray(_out1))
    print(sumArray(results))

    context.destroy()
    engine.destroy()
    runtime.destroy()



if __name__ == '__main__':
    main()

Is it possible there is some issue with the conv layers? The output from this code is:

Converted 0 variables to const ops.
Using output node Conv2D
Converting to UFF graph
No. nodes: 4
False
80897.3109283
81799.3861542

As you can see the output is different and the sum of the outputs are different as well. Am I doing something wrong? Is there a certain amount of tolerance, I need to allow? I am also very confused on how the conv layers of the Lenet network give different outputs but the final results are correct. Sorry again for the long questions. I really appreciate all the help.Thanks!!
vgg.zip (2.27 KB)

Hi,

Sorry for the late reply.

Looks like the input of TensorRT and TensorFlow is different.
It should be the transpose relation rather than reshape.

Could you apply following change and give it one more try?

def get_data():
    inputs=np.random.rand(3,28,28)
...
def forward_prop(inputs):
...
    net=tf.nn.conv2d(inputs,ones,strides=[1,2,2,1],data_format='NCHW')
...
def main():
    train_X=get_data()

    tensorrt_input=train_X

    tensorrt_input=tensorrt_input.astype(np.float32)
    X = tf.placeholder("float", shape=[1, 3, 28, 28])
    h_conv1=forward_prop(X)
...

Thanks

Hello,

Thank you so much. To be honest this was the source of all my issues with the pythonAPI. I cant believe it was such a rookie mistake on my part. I will get working with the C++ API now. Again thanks so much!

Good to know this. : )

Hello,

Thanks so much for all the help you gave me. I managed to perform inference successfully using my vgg network and the c++ API on the Jetson. However my mobilenet network still produces bad results as i stated in #1 of this topic. What will you need from me to help with this? The mobilenet network works with the python API as well.

Hi,

Sorry for the late reply.
It’s not easy to dig out the buggy layer of a deep model.

Could you share the detail about the issue you mentioned in #30?
Could you also reproduce the result in #28?

Thanks.

Hi, Could you please check this?

The code below contains a very simple and basic network (2 conv + 1 fc) and its conversion to UFF. However, I am getting different results between Tensorflow and TensorRT. Where is wrong?

For information, unlike the MNIST example, in this use case, an input has 3 channels (just like usual color images). Should I transpose the input somehow (e.g., NCHW)?

from __future__ import print_function
import numpy as np
import tensorflow as tf
import tensorrt as trt
from tensorrt.parsers import uffparser
import uff
import pycuda.driver as cuda
import pycuda.autoinit
slim = tf.contrib.slim

# ---- define network
with tf.Graph().as_default():
    image_placeholder = tf.placeholder(tf.float32, [None, 100, 100, 3])
    net = slim.repeat(image_placeholder, 2, slim.conv2d, 64, [3, 3], scope='conv1')
    net = slim.flatten(net)
    net = slim.fully_connected(net, 5, scope='pred/label_fc1')
    init = tf.global_variables_initializer()

    sess = tf.Session()
    sess.run(init)

    output_names = [net.op.name]     # which is "pred/label_fc1/Relu"
    graphdef = tf.get_default_graph().as_graph_def()
    frozen_graph = tf.graph_util.convert_variables_to_constants(sess, graphdef, output_names)
    frozen_graph = tf.graph_util.remove_training_nodes(frozen_graph)

# ---- model to uff
uff_model = uff.from_tensorflow(frozen_graph, output_names)
G_LOGGER = trt.infer.ConsoleLogger(trt.infer.LogSeverity.ERROR)
parser = uffparser.create_uff_parser()
parser.register_input("Placeholder", (3,100,100), 0)
for output_name in output_names:
    print('register output:', output_name)
    parser.register_output(output_name)
engine = trt.utils.uff_to_trt_engine(G_LOGGER, uff_model, parser, 1, 1 << 20)
parser.destroy()

# ---- tensorflow inference
temp = np.random.rand(1, 100, 100, 3).astype(np.float32)  # random input
tf_results = sess.run(net, feed_dict={image_placeholder:temp})

# ---- tensorRT inference
runtime = trt.infer.create_infer_runtime(G_LOGGER)
context = engine.create_execution_context()
tr_result = np.empty(5, dtype=np.float32)

d_input = cuda.mem_alloc(1 * temp.size * temp.dtype.itemsize)
d_labels = cuda.mem_alloc(1 * tr_result.size * tr_result.dtype.itemsize)

bindings = [int(d_input), int(d_labels)]
stream = cuda.Stream()
cuda.memcpy_htod_async(d_input, temp, stream)
context.enqueue(1, bindings, stream.handle, None)
cuda.memcpy_dtoh_async(tr_result, d_labels, stream)
stream.synchronize()

# ---- let's see
print("tensorflow result: ", tf_results[0])
print("tensorRT result: ", tr_result)

Result:

tensorflow result:  [0.         0.         0.         0.09752206 0.        ]
tensorRT result:  [0.         0.         0.00887266 0.11281744 0.        ]

Hi,

Could you reformat the TensorRT input data from NHWC to NCHW and give it a try?
Thanks.

Hello,

Sorry for the very late response. I have been very busy lately.

Issue number 30 was because I was reshaping input weights instead on transposing them. Hence the pythonAPI wasn’t working because I was giving it the wrong preprocessed array.

The error in #28 was also caused by this problem. Right now however even though the mobilenet network works perfectly with python, it fails when I use the C++ API. Hope I’m being clear. Please let me know thanks! Also is it possible for a layer to work properly with the python API then get buggy in C++. Because that might be the case.

Hi,

Python API is a wrapper for the TensorRT C++ library.
As a result, instead of checking the accuracy of C++ layer, it’s recommended to compare the input between Python API and C++ interface first.

Thanks.

Hello,

Thanks for the reply. I am certain that both the Python API and C++ interface are receiving the same input which is why I am confused as to why my mobilenet is not working.

Also I noticed an error when converting one of my other networks. It was “InceptionV4” and it worked perfectly on the python API but on C++ it gave an error " Unsupported operation Flatten". I managed to work around it by changing flatten to reshape. But isn’t flatten supported by TensorRT 3.0? Just wanted to bring it to your attention.

i tried the code in #34 and with the tensorrt input to be NCHW, but the result is still not same as tf

Hi,

Squeeze and flatten operations are available until TensorRT 3.0 GA(libnvinfer4.0.1).
Jetson is with TensorRT 3.0 RC(libnvinfer4.0.0), which doesn’t support these ops yet.

Thanks.