Inference of model using tensorflow/onnxruntime and TensorRT gives different result

Hi. I have a simple model which i trained using tensorflow. After that i converted it to ONNX and tried to make inference on my Jetson TX2 with JetPack 4.4.0 using TensorRT, but results are different.

That’s how i get inference model using onnx (model has input [-1, 128, 64, 3] and output [-1, 128]):

import onnxruntime as rt
import cv2 as cv
import numpy as np


sess = rt.InferenceSession("model_tf_float_opset10.onnx")
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name

im = cv.imread('cut10735.png')
im = cv.resize(im, (64, 128))

val_x = []
val_x.append(np.asarray(im).astype(np.float32))

print(label_name, input_name)
try:
    pred = sess.run([output], {input_name: val_x})[0]
    print(pred)
except Exception as e:
    print("Unexpected type")
    print("{0}: {1}".format(type(e), e))

That’s how i get inference with TensorRT:

IBuilder* builder = nvinfer1::createInferBuilder(gLogger);
const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
INetworkDefinition* network = builder->createNetworkV2(explicitBatch);
  
nvonnxparser::IParser* parser =
        nvonnxparser::createParser(*network, gLogger);
parser->parseFromFile("model_tf_float_opset10.onnx", (int)ILogger::Severity::kWARNING);
auto config = builder->createBuilderConfig();
config->setMaxWorkspaceSize(1 << 20);

auto profile = builder->createOptimizationProfile();
profile->setDimensions("images:0", OptProfileSelector::kMIN, Dims4{1, 128, 64, 3});
profile->setDimensions("images:0", OptProfileSelector::kOPT, Dims4{8, 128, 64, 3});
profile->setDimensions("images:0", OptProfileSelector::kMAX, Dims4{16, 128, 64, 3});

config->addOptimizationProfile(profile);

ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
int inputIndex = engine->getBindingIndex("images:0");
int outputIndex = engine->getBindingIndex("features:0");


cv::Mat img = cv::imread("cut10735.png");
cv::resize(img, img, cv::Size(64, 128));
img.convertTo(img, CV_32FC3);
cv::cuda::GpuMat output(128, 1, CV_32F);
cv::cuda::GpuMat gpu_dst(128, 64, CV_32FC3);
gpu_dst.upload(img);
Mat result(128, 1, CV_32F);

void* buffers[2];
buffers[inputIndex] = gpu_dst.data;
buffers[outputIndex] = output.data;

IExecutionContext *context = engine->createExecutionContext();
context->setBindingDimensions(0, Dims4{1, 128, 64, 3});
output.download(result);
context->executeV2(buffers);
  
output.download(result);
std::cout << result << std::endl;

What can be wrong here?
Here you can find my model and test image:
https://drive.google.com/drive/folders/1xEYcoQwOew-a74c6jpxDJtKu49NNNTgW?usp=sharing

Hi,

May I know which JetPack4.4 version do you use? DP or GA release?

There are some TensorRT issue fixed in our TensorRT GA release.
If you are using DP version, it’s recommended to give the latest JetPack a try first.

Thanks.

Hi. Thanks for your answer. I use Last (GA) release of JetPack.

Thanks for your feedback.

We are going to reproduce this on our environment.
Will update more information with you once we got any progress.

Thanks.

Hi,

Sorry for the late update.

It looks like you are using NHWC input format to feed into TensorRT.
However, TensorRT use NCHW format if no special data format specified.

Would you mind to convert the data into NCHW format and try it again?

Thanks.

Hi. I have just returned to this task. I converted image from hwc to chw, but still get different results. Have you tried executing my model in your environment? I still can’t figure out what I am doing wrong. It looks like the problem is in TensorRT… Here is fixed code:

void hwc_to_chw(Mat& src, Mat& dst) 
{

  const int src_h = src.rows;
  const int src_w = src.cols;
  const int src_c = src.channels();

  cv::Mat hw_c = src.reshape(1, src_h * src_w);
  const std::array<int, 3> dims = {src_c, src_h, src_w};
  dst.create(3, &dims[0], CV_MAKETYPE(src.depth(), 1));
  dst = dst.reshape(1, {src_c, src_h, src_w});
  cv::transpose(hw_c, dst);

}

IBuilder* builder = nvinfer1::createInferBuilder(gLogger);
const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
INetworkDefinition* network = builder->createNetworkV2(explicitBatch);
  
nvonnxparser::IParser* parser =
        nvonnxparser::createParser(*network, gLogger);
parser->parseFromFile("model_tf_float_opset10.onnx", (int)ILogger::Severity::kWARNING);
auto config = builder->createBuilderConfig();
config->setMaxWorkspaceSize(1 << 20);

auto profile = builder->createOptimizationProfile();
profile->setDimensions("images:0", OptProfileSelector::kMIN, Dims4{1, 128, 64, 3});
profile->setDimensions("images:0", OptProfileSelector::kOPT, Dims4{8, 128, 64, 3});
profile->setDimensions("images:0", OptProfileSelector::kMAX, Dims4{16, 128, 64, 3});

config->addOptimizationProfile(profile);

ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
int inputIndex = engine->getBindingIndex("images:0");
int outputIndex = engine->getBindingIndex("features:0");


cv::Mat img = cv::imread("cut10735.png");
cv::resize(img, img, cv::Size(64, 128));
img.convertTo(img, CV_32FC3);
Mat chw;
hwc_to_chw(img, chw);
cv::cuda::GpuMat output(128, 1, CV_32F);
cv::cuda::GpuMat gpu_dst(128, 64, CV_32FC3);
gpu_dst.upload(chw);
Mat result(128, 1, CV_32F);

void* buffers[2];
buffers[inputIndex] = gpu_dst.data;
buffers[outputIndex] = output.data;

IExecutionContext *context = engine->createExecutionContext();
context->setBindingDimensions(0, Dims4{1, 128, 64, 3});
context->executeV2(buffers);
output.download(result);
std::cout << result << std::endl;

Hi,

Is this issue fixed by the source you shared above?
Or the output is still different?

Thanks.

Hi. No, i just added hwc to chw coversion to my code and tried to inference with chw image format, but output is still different…

Thanks for your confirm.
We are checking this issue internally. Will update here for any update.

Hi,

Could you share a complete C++ source with us so our testing environment are aligned.
More, we cannot download your model due to permission denied.
Could you help to enable it?

Thanks.

Hi. Complete example for C++:

#include <QDebug>
#include "NvOnnxParser.h"
#include "NvInfer.h"
#include <opencv2/core.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/imgcodecs.hpp>
#include <opencv2/cudaimgproc.hpp>
#include <iostream>

using namespace nvinfer1;
using namespace std;
using namespace cv;
class Logger : public ILogger
{
    void log(Severity severity, const char* msg) override
    {
        // suppress info-level messages
        if (severity != Severity::kINFO)
            qDebug() << msg;
    }
} gLogger;

void hwc_to_chw(Mat& src, Mat& dst)
{
    const int src_h = src.rows;
    const int src_w = src.cols;
    const int src_c = src.channels();

    cv::Mat hw_c = src.reshape(1, src_h * src_w);
    const std::array<int, 3> dims = {src_c, src_h, src_w};
    dst.create(3, &dims[0], CV_MAKETYPE(src.depth(), 1));
    dst = dst.reshape(1, {src_c, src_h, src_w});
    cv::transpose(hw_c, dst);
}

int main(int argc, char *argv[])
{
    IBuilder* builder = nvinfer1::createInferBuilder(gLogger);
    const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
    INetworkDefinition* network = builder->createNetworkV2(explicitBatch);

    nvonnxparser::IParser* parser =
            nvonnxparser::createParser(*network, gLogger);
    parser->parseFromFile("model_tf_float_opset10.onnx", (int)ILogger::Severity::kWARNING);
    auto config = builder->createBuilderConfig();
    config->setMaxWorkspaceSize(1 << 20);

    auto profile = builder->createOptimizationProfile();
    profile->setDimensions("images:0", OptProfileSelector::kMIN, Dims4{1, 128, 64, 3});
    profile->setDimensions("images:0", OptProfileSelector::kOPT, Dims4{8, 128, 64, 3});
    profile->setDimensions("images:0", OptProfileSelector::kMAX, Dims4{16, 128, 64, 3});

    config->addOptimizationProfile(profile);

    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
    int inputIndex = engine->getBindingIndex("images:0");
    int outputIndex = engine->getBindingIndex("features:0");


    cv::Mat img = cv::imread("cut10735.png");
    cv::resize(img, img, cv::Size(64, 128));
    img.convertTo(img, CV_32FC3);
    Mat chw;
    hwc_to_chw(img, chw); // without also different result

    cv::cuda::GpuMat output(128, 1, CV_32F);
    cv::cuda::GpuMat gpu_dst(128, 64, CV_32FC3);
    gpu_dst.upload(chw);
    Mat result(128, 1, CV_32F);

    void* buffers[2];
    buffers[inputIndex] = gpu_dst.data;
    buffers[outputIndex] = output.data;

    IExecutionContext *context = engine->createExecutionContext();
    context->setBindingDimensions(0, Dims4{1, 128, 64, 3});
    context->executeV2(buffers);
    output.download(result);
    std::cout << result << std::endl;
    return 0;
}

And result:

    [0.21741617;
     0.018074907;
     -0.11390054;
     -0.063543327;
     -0.025937777;
     0.035865564;
     -0.076091036;
     0.052894063;
     0.038509559;
     -0.086018324;
     -0.01179905;
     -0.097331896;
     -0.0059410459;
     0.12683025;
     -0.060086627;
     -0.10214382;
     0.11243917;
     0.031299014;
     0.038018111;
     0.10405114;
     0.12848331;
     0.087814629;
     0.0073070363;
     0.089286581;
     -0.0059706182;
     -0.010348215;
     0.048533771;
     0.016377596;
     -0.065473929;
     -0.071357735;
     -0.057700843;
     0.06062549;
     0.13593628;
     -0.089616664;
     0.12405562;
     0.032605886;
     0.14014122;
     0.066228069;
     0.17140938;
     -0.10419387;
     0.060142811;
     0.025047708;
     -0.04850373;
     -0.0051091611;
     -0.037088607;
     0.021987606;
     0.041916661;
     -0.074520893;
     -0.0069955303;
     -0.010275103;
     -0.093412094;
     0.020118032;
     0.013161802;
     -0.03882781;
     0.34909648;
     0.096535429;
     -0.13299593;
     -0.056748956;
     -0.05864013;
     0.18626797;
     -0.10133344;
     0.0037117086;
     0.056481041;
     -0.056401547;
     0.008236954;
     0.027182357;
     -0.086196691;
     -0.0098906467;
     -0.014210807;
     0.10143966;
     -0.018852601;
     -0.094578348;
     -0.082550853;
     -0.078829989;
     0.17038627;
     -0.049729101;
     -0.033981632;
     -0.071885966;
     -0.074963525;
     -0.070453018;
     0.077794887;
     -0.031517465;
     -0.057352651;
     0.24314322;
     -0.098119408;
     -0.11517338;
     0.047104623;
     0.12214255;
     0.042592634;
     -0.049099579;
     -0.0052699912;
     0.15962702;
     0.0061072111;
     -0.025500681;
     0.13695578;
     -0.016284261;
     0.02835446;
     0.0010018852;
     -0.029328998;
     -0.093557015;
     -0.088223211;
     -0.090557031;
     -0.040226765;
     -0.055105787;
     -0.063769467;
     -0.095169932;
     -0.019969737;
     -0.10545529;
     -0.073704861;
     -0.045880266;
     -0.057348069;
     -0.0088410918;
     0.032779064;
     -0.030334083;
     0.082117699;
     -0.054750722;
     0.12931874;
     -0.014893929;
     -0.050810333;
     0.078427881;
     -0.0071323351;
     -0.066322133;
     0.15791741;
     -0.09120097;
     -0.0047501461;
     -0.017844327;
     0.055174693;
     0.25382671]

Complete example for Python:

import onnxruntime as rt
import cv2 as cv
import numpy as np

sess = rt.InferenceSession("model_tf_float_opset10.onnx")
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name

im = cv.imread('D:/test/cut10735.png')
im = cv.resize(im, (64, 128))

val_x = []
val_x.append(np.asarray(im).astype(np.float32))

try:
    pred = sess.run([output_name], {input_name: val_x})[0]
    print(pred)
except Exception as e:
    print("Unexpected type")
    print("{0}: {1}".format(type(e), e))

And result:

[[ 0.01409904  0.04889973  0.0068512  -0.03029526  0.03189908 -0.00803056
   0.08499128  0.17684014  0.04767798  0.03999923  0.10761195 -0.02139558
  -0.07083604  0.21018471 -0.0954253   0.06183483  0.00641072  0.06384496
   0.07886926  0.1793935   0.08014196  0.28557324 -0.00301654  0.0646928
  -0.03479696  0.05344842 -0.07249779 -0.0734601   0.01972523 -0.04917162
   0.17919122 -0.05199863  0.11306548 -0.00073111  0.0602023   0.17280908
   0.10716759  0.06920998  0.07518236 -0.1005758  -0.06912518 -0.05041961
  -0.08174316  0.09229721  0.00539807 -0.00653988 -0.01792016 -0.06297861
  -0.00426906 -0.03353138 -0.07037119 -0.05906641  0.09653544 -0.05743868
   0.0004447  -0.0169575  -0.04731843  0.00805656  0.06733166 -0.05133148
   0.09549342 -0.05889614  0.00104034 -0.00839987 -0.07568271  0.09445982
   0.06090422 -0.07335876 -0.03737445  0.21655427  0.24781345 -0.09691562
  -0.04168833  0.09711373  0.08745644  0.09150615  0.08400141  0.03679254
   0.08579402 -0.03720342  0.08905887 -0.02892478 -0.01852894  0.18236476
   0.16203596 -0.01284258  0.15244164 -0.00778199  0.00091876 -0.0007562
   0.03243485  0.0005487   0.13538565 -0.01859052 -0.02918059 -0.2117243
   0.14415158 -0.10589323 -0.02980787  0.06000031 -0.11880822  0.03605219
   0.06211431  0.15909992  0.07994679  0.09088707  0.10202844 -0.07346293
   0.03759274 -0.11829291 -0.08547547 -0.02942014  0.02702895  0.10723414
  -0.00601167 -0.08887953  0.00301785 -0.02657861  0.13011433  0.01591381
  -0.0574094  -0.0917868  -0.01761637 -0.06247395 -0.02496385  0.025118
  -0.05821754 -0.05967677]]

Hi,

Thanks for the example.
Could you also enable the download permission of the model shared in the original post?

Thanks.

Hi.
Yes, sorry, here is actual link:
https://drive.google.com/drive/folders/1xEYcoQwOew-a74c6jpxDJtKu49NNNTgW

Hi,

Thanks.
We can reproduce this issue in our environment now.
Will update here for any progress.

Hi,

We have confirmed this is an application issue rather than a TensorRT error.

To align the input pre-processing, we rewrite your app with python interface.
After that, we can get the same output as onnxruntime.

Please check if there is any difference in the OpenCV pre-processing.

trt.py.txt (1.7 KB)

$ /usr/src/tensorrt/bin/trtexec --onnx=model_tf_float_opset10.onnx --minShapes=128,64,3  --optShapes=128,64,3 --maxShapes=128,64,3 --dumpOutput --saveEngine=model.trt
$ python3 trt.py

Thanks.

Hi, AastaLLL. Thanks for your reply. I confirm that inference using tensorrt with python works correctly. But i’m probably blind or stupid because i still can’t find any difference between c++ code and python code and still getting wrong results on c++.

So, what i did:

  1. I made engine using trtexec command from your post
  2. I checked that it gives correct inference results on python.
  3. I compared preprocess in python and in c++, so:

C++:

cv::Mat img = cv::imread("cut10735.png");

cv::resize(img, img, cv::Size(64, 128));

img.convertTo(img, CV_32FC3);

cv::cuda::GpuMat gpu_dst(128 * 64 * 3, 1, CV_32F);

gpu_dst.upload(img);

// just for check
gpu_dst.download(img);
std::cout << gpu_dst.cols * gpu_dst.rows * gpu_dst.channels() << "  " << img << std::endl;

so, here we have resized to 64x128 float32 input and cout gives (here i provide just begin and end of output):
24576 [91, 92, 53, 90, 91… 50, 63, 52, 49]

Let’s compare that with python:

im = cv2.imread('cut10735.png')
im = cv2.resize(im, (64, 128))
np.copyto(host_inputs[0], im.ravel())
print(np.shape(host_inputs[0]),host_inputs[0].dtype, host_inputs[0])

print output gives:

(24576,) float32 [91. 92. 53. ... 63. 52. 49.]

as i can see, here we also have float32 input which is same as in c++

after that i need to do inference:
C++:

    cv::cuda::GpuMat output(1, 128, CV_32F);

    Mat result;

    void* buffers[2];

    int inputIndex = engine->getBindingIndex("images:0");

    int outputIndex = engine->getBindingIndex("features:0");

    buffers[inputIndex] = gpu_dst.data;

    buffers[outputIndex] = output.data;

    IExecutionContext *context = engine->createExecutionContext();

    context->setBindingDimensions(0, Dims4{1, 128, 64, 3});

    context->execute(1, buffers);

    output.download(result);

std::cout << result << std::endl;

python:

cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)
context.set_binding_shape(0, [1, 128, 64, 3])
context.execute_async(bindings=bindings, stream_handle=stream.handle, batch_size=1)
cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
stream.synchronize()
print("execute times "+str(time.time()-start_time))
output = host_outputs[0].reshape(np.concatenate(([1],engine.get_binding_shape(1))))
print(output)

I don’t see any difference here between c++ and python… but results are different…

1 Like

Do you have any thoughts?

Hi,

Sorry for the late update.
Would you mind giving some in-depth investigation into the below two lines?

buffers[inputIndex] = gpu_dst.data;
buffers[outputIndex] = output.data;

Please noted that it’s expected to use the NCHW format as the TensorRT input and output.
But OpenCV may use the HWC format, which leads to the output difference.

To figure out this, you can show the buffers[inputIndex] as a 1D array to ensure all the data are aligned.
Thanks.

Hi. Thanks for your answer. As you can see in my previous post, i have already did this

Hi,

Sorry to miss the array validation shared above.

Based on your implementation:

context->execute(1, buffers);
output.download(result);

There is no synchronized mechanism between GPU tasks and CPU tasks.
So CPU may try to copy the buffer back before the inference job is done.

Would you mind to add a synchronization call in between to see if it helps first?

context->execute(1, buffers);
cudaDeviceSynchronize();
output.download(result);

Thanks.