Transpose operation is not performed when reshape operation follows

Just FYI, I submitted the same issue via bug reporting system(#2444148).

[Platform details]
Linux distro and version: Ubuntu 16.04.5 LTS
GPU type: GeForce GTX 1080 Ti
nvidia driver version: 384.111
CUDA version: 9.0.176
CUDNN version: 7.3.1.20
Python version [if using python]: 3.5.2
Tensorflow version: 1.11.0
TensorRT version: debian packages with 5.0.2-1+cuda9.0 in nv-tensorrt-repo-ubuntu1604-cuda9.0-trt5.0.2.6-ga-20181009_1-1_amd64

[Python codes to reproduce]

#!/usr/bin/env python3

import numpy as np
import tensorflow as tf
import tensorrt as trt
import uff

try:
    import common
except ImportError:
    print('Need to import /usr/src/tensorrt/samples/python/common.py')
    print('e.g. export PYTHONPATH=$PYTHONPATH:/usr/src/tensorrt/samples/python/')
    raise

# https://devtalk.nvidia.com/default/topic/1038494/tensorrt/logicerror-explicit_context_dependent-failed-invalid-device-context-no-currently-active-context-/post/5284290/#5284290
import pycuda.autoinit

TRT_LOGGER = trt.Logger(trt.Logger.Severity.INFO)

graph = tf.Graph()
with graph.as_default():
    # [1, 4 * 10, 5, 1]
    input = tf.placeholder(tf.float32, [1, 4 * 10, 5, 1], name='input')

    # [1, 4 * 10, 5, 1] -> [1, 4, 10 * 5, 1]
    reshaped_0 = tf.reshape(input, [1, 4, -1, 1])

    # [1, 4, 10 * 5, 1] -> [1, 10 * 5, 4, 1]
    transposed_0 = tf.transpose(reshaped_0, perm=[0, 2, 1, 3])

    # [1, 10 * 5, 4, 1]  -> [1, 10 * 5 * 4, 1, 1]
    reshaped_1 = tf.reshape(transposed_0, [1, -1, 1, 1], name='output')

UFF_PATH = '/tmp/debug_trt_transpose.uff'

serialized_uff = uff.from_tensorflow(output_filename=UFF_PATH,
                                     output_nodes=['output'],
                                     quiet=False,
                                     text=False,
                                     graphdef=graph.as_graph_def())

with trt.Builder(TRT_LOGGER) as builder:
    with builder.create_network() as network:
        uff_parser = trt.UffParser()
        uff_parser.register_input('input', [4 * 10, 5, 1])
        uff_parser.register_output('output')
        uff_parser.parse(UFF_PATH, network)

        with builder.build_cuda_engine(network) as engine:
            inputs, outputs, bindings, stream = common.allocate_buffers(engine)

            with engine.create_execution_context() as context:
                input_np = np.array(range(4 * 10 * 5))

                np.copyto(inputs[0].host, input_np)
                results = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)

                actual = results[0].reshape((1, -1, 1, 1))
                expected = input_np.reshape((1, 4, -1, 1)).transpose([0, 2, 1, 3]).reshape((1, -1, 1, 1))

                np.testing.assert_allclose(actual, expected)

[Log output]

UFF Version 0.5.5
=== Automatically deduced input nodes ===
[name: "input"
op: "Placeholder"
attr {
  key: "dtype"
  value {
    type: DT_FLOAT
  }
}
attr {
  key: "shape"
  value {
    shape {
      dim {
        size: 1
      }
      dim {
        size: 40
      }
      dim {
        size: 5
      }
      dim {
        size: 1
      }
    }
  }
}
]
=========================================

Using output node output
Converting to UFF graph
No. nodes: 7
UFF Output written to /tmp/debug_trt_transpose.uff
[TensorRT] INFO: UFFParser: parsing input
[TensorRT] INFO: UFFParser: parsing Reshape/shape
[TensorRT] INFO: UFFParser: parsing Reshape
[TensorRT] INFO: UFFParser: parsing transpose
[TensorRT] INFO: UFFParser: parsing output/shape
[TensorRT] INFO: UFFParser: parsing output
[TensorRT] INFO: UFFParser: parsing MarkOutput_0
[TensorRT] INFO: Original: 6 layers
[TensorRT] INFO: After dead-layer removal: 4 layers
[TensorRT] INFO: After scale fusion: 4 layers
[TensorRT] INFO: Fusing Reshape with transpose
[TensorRT] INFO: Fusing Reshape + transpose with (Unnamed Layer* 4) [Shuffle]
[TensorRT] INFO: Fusing Reshape + transpose + (Unnamed Layer* 4) [Shuffle] with output
[TensorRT] INFO: After vertical fusions: 1 layers
[TensorRT] INFO: After swap: 1 layers
[TensorRT] INFO: After final dead-layer removal: 1 layers
[TensorRT] INFO: After tensor merging: 1 layers
[TensorRT] INFO: After concat removal: 1 layers
[TensorRT] INFO: Graph construction and optimization completed in 0.000125836 seconds.
[TensorRT] INFO: 
[TensorRT] INFO: --------------- Timing Reshape + transpose + (Unnamed Layer* 4) [Shuffle] + output(19)
[TensorRT] INFO: Tactic 0 is the only option, timing skipped
[TensorRT] INFO: Formats and tactics selection completed in 0.677852 seconds.
[TensorRT] INFO: After reformat layers: 1 layers
[TensorRT] INFO: Block size 0
[TensorRT] INFO: Total Activation Memory: 0
[TensorRT] INFO: Data initialization and engine generation completed in 0.00462642 seconds.
Traceback (most recent call last):
  File "/debug_trt_transpose.py", line 62, in <module>
    np.testing.assert_allclose(actual, expected)
  File "/virtualenvs/lib/python3.5/site-packages/numpy/testing/nose_tools/utils.py", line 1396, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/virtualenvs/lib/python3.5/site-packages/numpy/testing/nose_tools/utils.py", line 779, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

(mismatch 99.0%)
 x: array([[[[  0.]],

        [[  1.]],...
 y: array([[[[  0]],

        [[ 50]],...

Hello,

we’ve reproduced it locally and will be triaging. will keep you updated.

Hi,

I’m just checking if there is any update on this issue.

Hello,

This seems to be a known issue our engineers are working on. Will keep you updated.