I don't get similar results with TensorRT and the trained tensorflow model!

Hi,

I don’t get similar results with TensorRT and the trained tensorflow model!

For example for a batch of size 19, I get this from trained tensorflow model (3 classes):

[[6.0240232e-04 1.3105543e-03 9.9808705e-01]
 [3.6373714e-03 6.9858474e-03 9.8937678e-01]
 [2.9819757e-03 5.3826626e-03 9.9163532e-01]
 [4.3369378e-03 7.5148577e-03 9.8814827e-01]
 [8.7779770e-03 1.5823688e-02 9.7539830e-01]
 [4.9051787e-03 8.8414559e-03 9.8625344e-01]
 [7.6577030e-03 1.4652319e-02 9.7768992e-01]
 [1.9303973e-01 1.7818052e-01 6.2877971e-01]
 [5.8375727e-02 1.0351211e-01 8.3811218e-01]
 [3.3485282e-03 5.2936333e-03 9.9135780e-01]
 [2.1252513e-02 3.4929726e-02 9.4381779e-01]
 [4.6547498e-03 4.0444736e-03 9.9130076e-01]
 [8.4095538e-02 1.3293470e-01 7.8296977e-01]
 [6.8616783e-03 1.5771488e-02 9.7736686e-01]
 [2.9672135e-03 5.1083490e-03 9.9192446e-01]
 [5.7883211e-03 1.1918653e-02 9.8229307e-01]
 [2.7834701e-03 7.0321797e-03 9.9018431e-01]
 [3.2245289e-03 6.9324719e-03 9.8984307e-01]
 [1.1025379e-02 1.6933834e-02 9.7204077e-01]]

and for the same input, I get this from TensorRT:

[[9.33836460e-01 6.61635026e-02 7.27533485e-12]
 [9.39429879e-01 6.05701208e-02 3.69911090e-12]
 [9.60956275e-01 3.90437581e-02 1.84191551e-12]
 [9.52795386e-01 4.72046025e-02 1.99532123e-12]
 [9.19843435e-01 8.01565349e-02 5.39801103e-12]
 [9.51802194e-01 4.81977351e-02 5.10215741e-12]
 [9.62418616e-01 3.75813469e-02 1.31316594e-12]
 [9.84232545e-01 1.57674570e-02 1.14268006e-14]
 [9.79626715e-01 2.03733463e-02 3.17116023e-14]
 [9.95743811e-01 4.25621541e-03 8.78552331e-14]
 [9.82334971e-01 1.76650658e-02 3.78184490e-13]
 [9.75318611e-01 2.46814489e-02 3.20312830e-12]
 [9.79469538e-01 2.05304530e-02 1.96992487e-12]
 [9.49763775e-01 5.02362810e-02 7.18239812e-13]
 [9.05427277e-01 9.45727080e-02 1.35735251e-11]
 [9.17646766e-01 8.23532864e-02 2.70730747e-12]
 [8.63423824e-01 1.36576220e-01 4.19122679e-12]
 [9.23897922e-01 7.61020854e-02 1.90187167e-12]
 [9.45324779e-01 5.46751693e-02 5.95223523e-13]]

My Model has:

  1. Convolutional layers (using tf.layers.conv2d)
  2. leaky relu layers (using tf.maximum )
  3. batch normalization layers (using tf.layers.batch_normalization)
  4. changing NHWC to NCHW at the beginning and changing NCHS to NHWC at the end (using tf.transpose)
  5. flatten (using tf.reshape)
  6. dense layer (using tf.layers.dense)
  7. softmax (using tf.nn.softmax)

I hope it’s not because one or some of what I’m using is not supported as I spent a lot of time to get to this stage! :(

Thanks…

Hi,

It’s recommended to start TensorRT with our official sample located at ‘/usr/local/lib/python2.7/dist-packages/tensorrt/examples/tf_to_trt/’.

Here are some suggestions for your:
1. Please remember to convert your input into np.float32.

input_img = input_img.astype(np.float32)

2. It’s recommended to use pure NCHW format.
It is tricky to handle a mixed NHWC/NCHW network.

Thanks.

Hi AstaLLL,

Thanks for your reply. I tried your suggestions:

  1. I converted my model to pure NCHW so no tf.transpose anymore.
  2. I already convert my data to np.float32. So, no change in this regard!
  3. I looked at samples frovided at tf_to_trt folder and can’t spot anything different!

But, I still have the same problem. Results from tensorflow model are different from results that I get from tensorRT model. Any suggestion?!

Here is my code:

def inference_TRT(batch_size, data):
    # inference TensorRT
    data = data.astype(np.float32)
    output = np.empty(batch_size * (config['num_classes'] + 1), dtype = np.float32)

    G_LOGGER = trt.infer.ConsoleLogger(trt.infer.LogSeverity.INFO)
    engine = trt.utils.load_engine(G_LOGGER, config_path['model_trt_engine'])
    runtime = trt.infer.create_infer_runtime(G_LOGGER)
    context = engine.create_execution_context()

    #allocate device memory
    d_input = cuda.mem_alloc(batch_size * data[0].size * data[0].dtype.itemsize)
    d_output = cuda.mem_alloc(batch_size * output.size * output.dtype.itemsize)

    bindings = [int(d_input), int(d_output)]
    stream = cuda.Stream()

    cuda.memcpy_htod_async(d_input, data, stream) # transfer input data to device
    context.enqueue(batch_size, bindings, stream.handle, None) # execute model
    cuda.memcpy_dtoh_async(output, d_output, stream) # transfer predictions back
    stream.synchronize() #syncronize threads
    
    context.destroy()
    engine.destroy()
    runtime.destroy()
    
    return output

Appreciate your help.

Can someone reply please?! I’ve stuck and don’t know what to do anymore!

If it helps, here is the way I make my tensorRT engine:

def build_tensorrt_model(batch_size):
    # build tensorrt engine
    tf.reset_default_graph()
    x = tf.placeholder(dtype = tf.float32, shape = [None,
                                                    config['num_channels'],
                                                    config['x_height'],
                                                    config['x_width']])
    _, _, D_real_prob = D(x) # discriminator model

    saver = tf.train.Saver()
    with tf.Session() as sess:
        saver.restore(sess, config_path['model_discriminator'])

        # convert a tf model to uff mdoel
        graphdef = tf.get_default_graph().as_graph_def()
        frozen_graph = tf.graph_util.convert_variables_to_constants(sess,
                                                                    graphdef,
                                                                    ['Discriminator/Softmax'])
        tf_model = tf.graph_util.remove_training_nodes(frozen_graph)
        uff_model = uff.from_tensorflow(tf_model, ['Discriminator/Softmax'])

    # import a uff model into tensorrt and create an engine
    G_LOGGER = trt.infer.ConsoleLogger(trt.infer.LogSeverity.INFO)

    parser = uffparser.create_uff_parser()
    parser.register_input('Placeholder', (config['num_channels'],
                                          config['x_height'],
                                          config['x_width']), 0)
    parser.register_output('Discriminator/Softmax')
    engine = trt.utils.uff_to_trt_engine(G_LOGGER, uff_model, parser,
                                         batch_size, 1 << 20, 
                                         trt.infer.DataType.FLOAT)
        
    runtime = trt.infer.create_infer_runtime(G_LOGGER)
    context = engine.create_execution_context()
    trt.utils.write_engine_to_file(config_path['model_trt_engine'], engine.serialize())

and batch_size = 19 and number of classes is 3.

Thanks

Hi,

Could you share the dtype of data?
Guess that it is a list variable, right?

We want to check if this issue comes from different arrangement of input data.
Could you set batch size = 1 and do it test?

Thanks.

Hi AstaLLL,

Thanks for your reply. data is a numpy array of size [N = 19, C = 1, H = 40, W = 256] and it’s of type np.float32. I tried with batch_size 1 but no luck! :(

Thanks

I actually spent last weekend with similar issues. We had a network built with tf.layers that when converted to TensorRT gave completely wrong inference results. We were extremely frustrated since we were following the exact same steps to build the engine and run inference as is seen in the examples.

What got it to work eventually was converting the tf.layers portions of our code to use tf.nn instead. We do no special padding or anything like that, but for each layer we now manually call various functions like relu and adding bias.

My guess is that the UFF conversion makes certain assumptions about what steps are performed in each layer and those assumptions do not hold when tf.layers is used. As a result, the model will still be converted, but will no longer be equivalent.

Hi dwd_pete,

Thanks for the valuable information. I’d never figure that out! Appreciate. I’ll try it and post my findings.

Thanks

ok , I replaced my tf.layers.conv2d and tf.layers.dense with:

def weight(name, shape):
    init = tf.initializers.variance_scaling(config['kernel_init_scale'])
    W = tf.get_variable(name, shape = shape, dtype = tf.float32, initializer = init)
    return W
def bias(name, shape):
    b = tf.constant(0.0, shape = shape, dtype = tf.float32)
    return tf.Variable(b, name = name, dtype = tf.float32)
def conv2d_layer(inputs, filters, kernel_size, strides, 
                 name, padding = 'SAME', data_format='NCHW'):
    # convolutional layer cause TensorRT doesn't work with tf.layers.conv2d! 
    W_conv = weight('W_' + name,
                    [kernel_size[0], kernel_size[1], inputs.get_shape()[1],
                    filters])
    b_conv = bias('b_' + name, [filters])
    tmp = tf.nn.conv2d(inputs, W_conv, 
                        strides = [1, 1, strides[0], strides[1]], 
                        padding = padding, data_format = data_format)
    tmp = tf.transpose(tmp, [0, 2, 3, 1]) # NCHW to NHWC 
    conv = tmp + b_conv #  Don't know how to do this without the above transpose!!!!
    conv = tf.transpose(conv, [0, 3, 1, 2]) # NHWC to NCHW (have to go back to NCHW)
    return conv
def dense_layer(inputs, units, name):
    # dense layer cause TensorRT doesn't work with tf.layers.dense!
    W_dense = weight('W_' + name, [inputs.get_shape()[1], units])
    b_dense = bias('b_' + name, [units])
    dense = tf.matmul(inputs, W_dense) + b_dense
    return dense

and I still have the same problem.

The only tf.layers thing that I kept is tf.layers.batch_normalization as tf.nn.batch_normalization is a bit complicated. So, I tried to avoid it!

Now, my question from NVIDIA:

I train a semi-supervised learning GAN, and then I save the GAN’s discriminator only as I don’t need the generator anymore. The way that I save the discriminator is:

with tf.Session() as sess:
   variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope = 'Discriminator')
   saver = tf.train.Saver(variables, max_to_keep = 1)
   saver.save(sess, config_path['model_discriminator'])

Do you think the way that I’m saving Discriminator is causing problem?! So, some parts of the graph are not saved this way which doesn’t have impact when I restore discriminator but has impact on the way I build the tensorRT engine?!

That’s the only thing that I can think of now. Looking forward to your feedback and comments.

If it helps, my discriminator has some layers of:

# layer2
conv2 = tf.layers.conv2d(dropout1, 64, [2, 5],
                         strides = [2, 2],
                         padding = 'same',
                         data_format = 'channels_first',
                         kernel_initializer = tf.initializers.variance_scaling(config['kernel_init_scale']))
batch_norm2 = tf.layers.batch_normalization(conv2, training = is_training, axis = 1)
lrelu2 = tf.maximum(0.2 * batch_norm2, batch_norm2)

and at the end:

# layer 6
flatten_length = lrelu5.get_shape().as_list()[1] * \
                 lrelu5.get_shape().as_list()[2] * lrelu5.get_shape().as_list()[3]
flatten6 = tf.reshape(lrelu5, (-1, flatten_length)) # used for "Feature Matching" 
fc6 = tf.layers.dense(flatten6, (config['num_classes'] + 1),
                                   kernel_initializer = tf.initializers.variance_scaling(config['kernel_init_scale']))
output = tf.nn.softmax(fc6)

I have a couple of tf.layers.dropout too which I remove from the model before I save the discriminator.

Thanks

I tried to compare the results of each layers. I started with lrelu1 in:

conv1 = tf.layers.conv2d(x, 32, [3, 7],
                                 strides = [2, 2],
                                 padding = 'same', 
                                 data_format = 'channels_first')
lrelu1 = tf.maximum(0.2 * conv1, conv1, name = 'out_nejla') #leaky relu

and the outputs are ( I just copies the beginning of each):

From Tensorflow Model:

[[[[<b><u>-8.64069560e-04</u></b> <b><u>-7.18663388e-04</u></b> <b><u>-7.68368773e-04</u></b> ...  6.53133402e-03
     4.71562613e-04  1.38199981e-03]
   [-7.67161546e-04 -5.95747901e-04 -7.78576476e-04 ...  9.97669809e-03
     2.30994588e-03  9.81694087e-04]
   [-5.92876168e-04 -5.93446835e-04 -8.23813141e-04 ...  1.12019163e-02
     2.44506169e-03  1.06534036e-03]...

From TensorRT Inference Engine:

[[[[<b><u>-8.93034681e-04 </u></b> -9.40369151e-04 -1.00716669e-03 ... -1.39611610e-03
    <b><u>-7.18663388e-04</u></b> <b><u>-7.68368773e-04</u></b>]
   [-2.10812548e-04 -1.67399790e-04 -4.11662331e-04 ... -7.78576476e-04
     8.37342814e-05 -3.67132539e-04]
   [ 2.53846729e-03  7.23261712e-03  2.59393733e-03 ...  8.97953752e-04
     1.77786220e-04  6.90183928e-03]...

Though at the beginning they look different, but looking closely shows some very close numbers like:-8.64069560e-04 and -7.18663388e-04, -7.68368773e-04

When I get the output from the inference engine, I reshape it using:

preds.reshape(batch_size, 32, 20, 128)

Hi Nvidia

I also tried the conv2d_layer built with tf.nn.conv2d that I posted in #9 and I can see similar situation in #10.

I’m not sure but I think TensorRT has a bug and the problem is with the way that filter size is handled in TensorRT. The majority of architectures use similar filter_height and filter_width. But, as you can see in my posts, I’m using different values for filter_height and filter_width. I think that causes the problem and changes the order of outputs in the first layer. This condition propagates throughout the layers and at the end the result becomes something completely different.

I’m looking forward to NVIDIA’s feedback and park this at this stage.

Thanks.

Hi,

Leaky relu is not supported by TensorRT.
You can find detail information in our document:
[url]Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

Thanks.

Hi AstaLLL,

Thanks for your reply. Your document says TensorRT supports ElementWise Layers and maximum is in your list and that’s the way I’ve implemented Leaky_Relu! Not sure why you say Leaky_relu is not supported!! Can you explain why?!

Thanks

Hi AstaLLL,

I have some good news. It’s working now and I get same results from both tensorRT and tensorflow.

What I think NVidia needs to know are:

  1. There is a bug in TensorRT. When filter_width and filter_height are different, it messes up things and hence the output of TensorRT becomes different form the output of Tensorflow (please have a look at my comment, #10). I used filter_width = filter_height and the problem solved (tested and verified). Hope you fix this bug in your next release.

  2. TensorRT can handle leaky_Relu. I implemented leaky_relu with tf.maximum and the result from TensorRT is same as what I get from tensorflow.

In general, thank you Nvidia for your amazing work. There is a lot of value in TensorRT and looking forward to more amazing stuff from you guys… :)

Hi,

Sorry for the unclear comment in #12.
We means that the standard leaky relu op is not supported by TensorRT.
Good to know that there is an alternative to implement leaky rule with tf.maximum.

For question 1,
We have tested the conv2d op for filter_width ≠ filter_height, but fail to reproduce this issue.
The result of TensorFlow and TensorRT of our sample is identical.

Could you check this sample and share how to reproduce the issue you mentioned with us?

from tensorrt.parsers import uffparser
import tensorflow as tf
import tensorrt as trt
import pycuda.driver as cuda
import numpy as np
import uff


MAX_WORKSPACE = 1 << 20
MAX_BATCHSIZE = 1
G_LOGGER = trt.infer.ConsoleLogger(trt.infer.LogSeverity.INFO)

inputs = tf.placeholder(dtype=tf.float32, shape=[1,10,10,1])
output = tf.layers.conv2d(inputs, 1, [3,5])
output = tf.nn.sigmoid(output, name='out')

data = np.expand_dims(np.expand_dims(np.random.rand(10,10), axis=0), axis=3)
data = data.astype(np.float32)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    tf_result = sess.run(output,feed_dict={inputs:data})

    graphdef = tf.get_default_graph().as_graph_def()
    frozen_graph = tf.graph_util.convert_variables_to_constants(sess, graphdef, ['out'])
    tf_model = tf.graph_util.remove_training_nodes(frozen_graph)

uff_model = uff.from_tensorflow(tf_model, ['out'])

parser = uffparser.create_uff_parser()
parser.register_input("Placeholder", (1,10,10), 0)
parser.register_output("out")

engine = trt.utils.uff_to_trt_engine(G_LOGGER, uff_model, parser, MAX_BATCHSIZE, MAX_WORKSPACE)
parser.destroy()

runtime = trt.infer.create_infer_runtime(G_LOGGER)
context = engine.create_execution_context()

input_index = engine.get_binding_index('Placeholder')
output_index = engine.get_binding_index('out')

input_dim = engine.get_binding_dimensions(input_index).to_DimsCHW()
output_dim = engine.get_binding_dimensions(output_index).to_DimsCHW()

insize = input_dim.C() * input_dim.H() * input_dim.W()
outsize = output_dim.C() * output_dim.H() * output_dim.W()

trt_result = cuda.pagelocked_empty(outsize, dtype=np.float32)

d_input = cuda.mem_alloc(insize * data.dtype.itemsize)
d_output = cuda.mem_alloc(outsize * trt_result.dtype.itemsize)

bindings = [int(d_input), int(d_output)]
stream = cuda.Stream()

cuda.memcpy_htod_async(d_input, data, stream)
context.enqueue(1, bindings, stream.handle, None)
cuda.memcpy_dtoh_async(trt_result, d_output, stream)

print('TensorFlow:')
print(tf_result)
print('\nTensorRT:')
print(trt_result)

Thanks.

Hi AstaLLL,

Thanks for your reply. I tried your code. It produced same results. Then I changed the code to look more like mine and it produced different results!

The difference between yours and mine are:

  1. In my case, x_width ≠ x_height. In your case, x_width = x_height
  2. My model is channel_first. Your model is channel_last.
  3. I specified strides too.
  4. I specified padding as well.

Here is my code:

from tensorrt.parsers import uffparser
import tensorflow as tf
import tensorrt as trt
import pycuda.driver as cuda
import numpy as np
import uff

MAX_WORKSPACE = 1 << 20
MAX_BATCHSIZE = 1
G_LOGGER = trt.infer.ConsoleLogger(trt.infer.LogSeverity.INFO)

inputs = tf.placeholder(dtype=tf.float32, shape=[1, 1, 40, 256])
#output = tf.layers.conv2d(inputs, 1, [3, 5], )
output = tf.layers.conv2d(inputs, 64, [3, 7],
                          strides = [2, 2],
                          padding = 'same',
                          data_format = 'channels_first')
output = tf.nn.sigmoid(output, name='out')

data = np.expand_dims(np.expand_dims(np.random.rand(40,256), axis = 0), axis = 3)
data = np.transpose(data, [0, 3, 1, 2])
data = data.astype(np.float32)

with tf.Session() as sess:
   sess.run(tf.global_variables_initializer())
   tf_result = sess.run(output,feed_dict={inputs:data})

   graphdef = tf.get_default_graph().as_graph_def()
   frozen_graph = tf.graph_util.convert_variables_to_constants(sess, graphdef, ['out'])
   tf_model = tf.graph_util.remove_training_nodes(frozen_graph)

   uff_model = uff.from_tensorflow(tf_model, ['out'])

   parser = uffparser.create_uff_parser()
   parser.register_input("Placeholder", (1,40,256), 0)
   parser.register_output("out")

   engine = trt.utils.uff_to_trt_engine(G_LOGGER, uff_model, parser, MAX_BATCHSIZE, MAX_WORKSPACE)
   parser.destroy()

   runtime = trt.infer.create_infer_runtime(G_LOGGER)
   context = engine.create_execution_context()

   input_index = engine.get_binding_index('Placeholder')
   output_index = engine.get_binding_index('out')

   input_dim = engine.get_binding_dimensions(input_index).to_DimsCHW()
   output_dim = engine.get_binding_dimensions(output_index).to_DimsCHW()

   insize = input_dim.C() * input_dim.H() * input_dim.W()
   outsize = output_dim.C() * output_dim.H() * output_dim.W()

   trt_result = cuda.pagelocked_empty(outsize, dtype=np.float32)

   d_input = cuda.mem_alloc(insize * data.dtype.itemsize)
   d_output = cuda.mem_alloc(outsize * trt_result.dtype.itemsize)

   bindings = [int(d_input), int(d_output)]
   stream = cuda.Stream()

   cuda.memcpy_htod_async(d_input, data, stream)
   context.enqueue(1, bindings, stream.handle, None)
   cuda.memcpy_dtoh_async(trt_result, d_output, stream)

   print('TensorFlow:')
   print(tf_result.flatten()[:10])
   print('\nTensorRT:')
   print(trt_result[:10])

Here is the output that I get:

TensorFlow:
[0.517358   0.5182735  0.52483785 0.52156085 0.4992595  0.5175361
 0.5125967  0.51287144 0.52240974 0.5365737 ]

TensorRT:
[0.48985785 0.49997658 0.5048609  0.4967685  0.49668005 0.50184053
 0.5059915  0.48398086 0.5130922  0.50384027]

When I change kernel_size (I mistakenly called kernel_height/width, filter_height/width in my previous posts. sorry about that) to [3, 3], I get same results:

TensorFlow:
[0.47558555 0.47334933 0.46524027 0.44461906 0.46312934 0.4663003
 0.4677398  0.47723734 0.45848316 0.43497655]

TensorRT:
[0.47558555 0.47334933 0.46524027 0.44461906 0.46312934 0.4663003
 0.4677398  0.47723734 0.45848316 0.43497655]

There is also another bug in TensorRT. When at some stage in a layer of the model, the output of conv2d’s height or width becomes odd and not even, the last dense layer causes the TensorRT to crash at the build stage when I run trt.utils.uff_to_trt_engine. I’m still wrestling with this case. When I sort it out, I post more details about it. But, I think it’s good to let you know!

FYI, I posted the detials of the last bug that I reported at #16 here:
https://devtalk.nvidia.com/default/topic/1031380/jetson-tx2/tensorrt-model-cannot-be-built-when-at-some-layers-of-the-model-the-output-of-conv2ds-height-and-or-width-is-odd/ as I think it’s not related to this thread. :-)

You are right.

Feedback to our internal team. (for tf.layers.conv2d)
Will update information with you later.

Hi,

Sorry for keeping you waiting.

This issue is fixed in TensorFlow-1.8 with TensorRT-4.0.
Please install our latest TensorRT version for the fix.

Thanks.

Thanks AstaLLL for the good news. :)