Inference of model using tensorflow/onnxruntime and TensorRT gives different result

bumatov · July 30, 2020, 6:13pm

Hi. I have a simple model which i trained using tensorflow. After that i converted it to ONNX and tried to make inference on my Jetson TX2 with JetPack 4.4.0 using TensorRT, but results are different.

That’s how i get inference model using onnx (model has input [-1, 128, 64, 3] and output [-1, 128]):

import onnxruntime as rt
import cv2 as cv
import numpy as np


sess = rt.InferenceSession("model_tf_float_opset10.onnx")
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name

im = cv.imread('cut10735.png')
im = cv.resize(im, (64, 128))

val_x = []
val_x.append(np.asarray(im).astype(np.float32))

print(label_name, input_name)
try:
    pred = sess.run([output], {input_name: val_x})[0]
    print(pred)
except Exception as e:
    print("Unexpected type")
    print("{0}: {1}".format(type(e), e))

That’s how i get inference with TensorRT:

IBuilder* builder = nvinfer1::createInferBuilder(gLogger);
const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
INetworkDefinition* network = builder->createNetworkV2(explicitBatch);
  
nvonnxparser::IParser* parser =
        nvonnxparser::createParser(*network, gLogger);
parser->parseFromFile("model_tf_float_opset10.onnx", (int)ILogger::Severity::kWARNING);
auto config = builder->createBuilderConfig();
config->setMaxWorkspaceSize(1 << 20);

auto profile = builder->createOptimizationProfile();
profile->setDimensions("images:0", OptProfileSelector::kMIN, Dims4{1, 128, 64, 3});
profile->setDimensions("images:0", OptProfileSelector::kOPT, Dims4{8, 128, 64, 3});
profile->setDimensions("images:0", OptProfileSelector::kMAX, Dims4{16, 128, 64, 3});

config->addOptimizationProfile(profile);

ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
int inputIndex = engine->getBindingIndex("images:0");
int outputIndex = engine->getBindingIndex("features:0");


cv::Mat img = cv::imread("cut10735.png");
cv::resize(img, img, cv::Size(64, 128));
img.convertTo(img, CV_32FC3);
cv::cuda::GpuMat output(128, 1, CV_32F);
cv::cuda::GpuMat gpu_dst(128, 64, CV_32FC3);
gpu_dst.upload(img);
Mat result(128, 1, CV_32F);

void* buffers[2];
buffers[inputIndex] = gpu_dst.data;
buffers[outputIndex] = output.data;

IExecutionContext *context = engine->createExecutionContext();
context->setBindingDimensions(0, Dims4{1, 128, 64, 3});
output.download(result);
context->executeV2(buffers);
  
output.download(result);
std::cout << result << std::endl;

What can be wrong here?
Here you can find my model and test image:
https://drive.google.com/drive/folders/1xEYcoQwOew-a74c6jpxDJtKu49NNNTgW?usp=sharing

AastaLLL · July 31, 2020, 4:27am

Hi,

May I know which JetPack4.4 version do you use? DP or GA release?

There are some TensorRT issue fixed in our TensorRT GA release.
If you are using DP version, it’s recommended to give the latest JetPack a try first.

Thanks.

bumatov · July 31, 2020, 6:25am

Hi. Thanks for your answer. I use Last (GA) release of JetPack.

AastaLLL · August 3, 2020, 8:17am

Thanks for your feedback.

We are going to reproduce this on our environment.
Will update more information with you once we got any progress.

Thanks.

AastaLLL · August 19, 2020, 8:53am

Hi,

Sorry for the late update.

It looks like you are using NHWC input format to feed into TensorRT.
However, TensorRT use NCHW format if no special data format specified.

Would you mind to convert the data into NCHW format and try it again?

github.com

AastaNV/TRT_object_detection/blob/master/main.py#L84


      
          context = engine.create_execution_context()
          
          

          
# inference
          #TODO enable video pipeline
          #TODO using pyCUDA for preprocess
          ori = cv2.imread(sys.argv[1])
          image = cv2.cvtColor(ori, cv2.COLOR_BGR2RGB)
          image = cv2.resize(image, (model.dims[2],model.dims[1]))
          image = (2.0/255.0) * image - 1.0
          image = image.transpose((2, 0, 1))
          np.copyto(host_inputs[0], image.ravel())
          
          
start_time = time.time()
          cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)
          context.execute_async(bindings=bindings, stream_handle=stream.handle)
          cuda.memcpy_dtoh_async(host_outputs[1], cuda_outputs[1], stream)
          cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
          stream.synchronize()
          print("execute times "+str(time.time()-start_time))

Thanks.

bumatov · October 21, 2020, 7:49pm

Hi. I have just returned to this task. I converted image from hwc to chw, but still get different results. Have you tried executing my model in your environment? I still can’t figure out what I am doing wrong. It looks like the problem is in TensorRT… Here is fixed code:

void hwc_to_chw(Mat& src, Mat& dst) 
{

  const int src_h = src.rows;
  const int src_w = src.cols;
  const int src_c = src.channels();

  cv::Mat hw_c = src.reshape(1, src_h * src_w);
  const std::array<int, 3> dims = {src_c, src_h, src_w};
  dst.create(3, &dims[0], CV_MAKETYPE(src.depth(), 1));
  dst = dst.reshape(1, {src_c, src_h, src_w});
  cv::transpose(hw_c, dst);

}

IBuilder* builder = nvinfer1::createInferBuilder(gLogger);
const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
INetworkDefinition* network = builder->createNetworkV2(explicitBatch);
  
nvonnxparser::IParser* parser =
        nvonnxparser::createParser(*network, gLogger);
parser->parseFromFile("model_tf_float_opset10.onnx", (int)ILogger::Severity::kWARNING);
auto config = builder->createBuilderConfig();
config->setMaxWorkspaceSize(1 << 20);

auto profile = builder->createOptimizationProfile();
profile->setDimensions("images:0", OptProfileSelector::kMIN, Dims4{1, 128, 64, 3});
profile->setDimensions("images:0", OptProfileSelector::kOPT, Dims4{8, 128, 64, 3});
profile->setDimensions("images:0", OptProfileSelector::kMAX, Dims4{16, 128, 64, 3});

config->addOptimizationProfile(profile);

ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
int inputIndex = engine->getBindingIndex("images:0");
int outputIndex = engine->getBindingIndex("features:0");


cv::Mat img = cv::imread("cut10735.png");
cv::resize(img, img, cv::Size(64, 128));
img.convertTo(img, CV_32FC3);
Mat chw;
hwc_to_chw(img, chw);
cv::cuda::GpuMat output(128, 1, CV_32F);
cv::cuda::GpuMat gpu_dst(128, 64, CV_32FC3);
gpu_dst.upload(chw);
Mat result(128, 1, CV_32F);

void* buffers[2];
buffers[inputIndex] = gpu_dst.data;
buffers[outputIndex] = output.data;

IExecutionContext *context = engine->createExecutionContext();
context->setBindingDimensions(0, Dims4{1, 128, 64, 3});
context->executeV2(buffers);
output.download(result);
std::cout << result << std::endl;

AastaLLL · November 4, 2020, 7:00am

Hi,

Is this issue fixed by the source you shared above?
Or the output is still different?

Thanks.

bumatov · November 4, 2020, 10:36am

Hi. No, i just added hwc to chw coversion to my code and tried to inference with chw image format, but output is still different…

AastaLLL · November 6, 2020, 6:09am

Thanks for your confirm.
We are checking this issue internally. Will update here for any update.

AastaLLL · November 9, 2020, 5:57am

Hi,

Could you share a complete C++ source with us so our testing environment are aligned.
More, we cannot download your model due to permission denied.
Could you help to enable it?

Thanks.

bumatov · November 9, 2020, 9:17am

Hi. Complete example for C++:

#include <QDebug>
#include "NvOnnxParser.h"
#include "NvInfer.h"
#include <opencv2/core.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/imgcodecs.hpp>
#include <opencv2/cudaimgproc.hpp>
#include <iostream>

using namespace nvinfer1;
using namespace std;
using namespace cv;
class Logger : public ILogger
{
    void log(Severity severity, const char* msg) override
    {
        // suppress info-level messages
        if (severity != Severity::kINFO)
            qDebug() << msg;
    }
} gLogger;

void hwc_to_chw(Mat& src, Mat& dst)
{
    const int src_h = src.rows;
    const int src_w = src.cols;
    const int src_c = src.channels();

    cv::Mat hw_c = src.reshape(1, src_h * src_w);
    const std::array<int, 3> dims = {src_c, src_h, src_w};
    dst.create(3, &dims[0], CV_MAKETYPE(src.depth(), 1));
    dst = dst.reshape(1, {src_c, src_h, src_w});
    cv::transpose(hw_c, dst);
}

int main(int argc, char *argv[])
{
    IBuilder* builder = nvinfer1::createInferBuilder(gLogger);
    const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
    INetworkDefinition* network = builder->createNetworkV2(explicitBatch);

    nvonnxparser::IParser* parser =
            nvonnxparser::createParser(*network, gLogger);
    parser->parseFromFile("model_tf_float_opset10.onnx", (int)ILogger::Severity::kWARNING);
    auto config = builder->createBuilderConfig();
    config->setMaxWorkspaceSize(1 << 20);

    auto profile = builder->createOptimizationProfile();
    profile->setDimensions("images:0", OptProfileSelector::kMIN, Dims4{1, 128, 64, 3});
    profile->setDimensions("images:0", OptProfileSelector::kOPT, Dims4{8, 128, 64, 3});
    profile->setDimensions("images:0", OptProfileSelector::kMAX, Dims4{16, 128, 64, 3});

    config->addOptimizationProfile(profile);

    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
    int inputIndex = engine->getBindingIndex("images:0");
    int outputIndex = engine->getBindingIndex("features:0");


    cv::Mat img = cv::imread("cut10735.png");
    cv::resize(img, img, cv::Size(64, 128));
    img.convertTo(img, CV_32FC3);
    Mat chw;
    hwc_to_chw(img, chw); // without also different result

    cv::cuda::GpuMat output(128, 1, CV_32F);
    cv::cuda::GpuMat gpu_dst(128, 64, CV_32FC3);
    gpu_dst.upload(chw);
    Mat result(128, 1, CV_32F);

    void* buffers[2];
    buffers[inputIndex] = gpu_dst.data;
    buffers[outputIndex] = output.data;

    IExecutionContext *context = engine->createExecutionContext();
    context->setBindingDimensions(0, Dims4{1, 128, 64, 3});
    context->executeV2(buffers);
    output.download(result);
    std::cout << result << std::endl;
    return 0;
}

And result:

    [0.21741617;
     0.018074907;
     -0.11390054;
     -0.063543327;
     -0.025937777;
     0.035865564;
     -0.076091036;
     0.052894063;
     0.038509559;
     -0.086018324;
     -0.01179905;
     -0.097331896;
     -0.0059410459;
     0.12683025;
     -0.060086627;
     -0.10214382;
     0.11243917;
     0.031299014;
     0.038018111;
     0.10405114;
     0.12848331;
     0.087814629;
     0.0073070363;
     0.089286581;
     -0.0059706182;
     -0.010348215;
     0.048533771;
     0.016377596;
     -0.065473929;
     -0.071357735;
     -0.057700843;
     0.06062549;
     0.13593628;
     -0.089616664;
     0.12405562;
     0.032605886;
     0.14014122;
     0.066228069;
     0.17140938;
     -0.10419387;
     0.060142811;
     0.025047708;
     -0.04850373;
     -0.0051091611;
     -0.037088607;
     0.021987606;
     0.041916661;
     -0.074520893;
     -0.0069955303;
     -0.010275103;
     -0.093412094;
     0.020118032;
     0.013161802;
     -0.03882781;
     0.34909648;
     0.096535429;
     -0.13299593;
     -0.056748956;
     -0.05864013;
     0.18626797;
     -0.10133344;
     0.0037117086;
     0.056481041;
     -0.056401547;
     0.008236954;
     0.027182357;
     -0.086196691;
     -0.0098906467;
     -0.014210807;
     0.10143966;
     -0.018852601;
     -0.094578348;
     -0.082550853;
     -0.078829989;
     0.17038627;
     -0.049729101;
     -0.033981632;
     -0.071885966;
     -0.074963525;
     -0.070453018;
     0.077794887;
     -0.031517465;
     -0.057352651;
     0.24314322;
     -0.098119408;
     -0.11517338;
     0.047104623;
     0.12214255;
     0.042592634;
     -0.049099579;
     -0.0052699912;
     0.15962702;
     0.0061072111;
     -0.025500681;
     0.13695578;
     -0.016284261;
     0.02835446;
     0.0010018852;
     -0.029328998;
     -0.093557015;
     -0.088223211;
     -0.090557031;
     -0.040226765;
     -0.055105787;
     -0.063769467;
     -0.095169932;
     -0.019969737;
     -0.10545529;
     -0.073704861;
     -0.045880266;
     -0.057348069;
     -0.0088410918;
     0.032779064;
     -0.030334083;
     0.082117699;
     -0.054750722;
     0.12931874;
     -0.014893929;
     -0.050810333;
     0.078427881;
     -0.0071323351;
     -0.066322133;
     0.15791741;
     -0.09120097;
     -0.0047501461;
     -0.017844327;
     0.055174693;
     0.25382671]

Complete example for Python:

import onnxruntime as rt
import cv2 as cv
import numpy as np

sess = rt.InferenceSession("model_tf_float_opset10.onnx")
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name

im = cv.imread('D:/test/cut10735.png')
im = cv.resize(im, (64, 128))

val_x = []
val_x.append(np.asarray(im).astype(np.float32))

try:
    pred = sess.run([output_name], {input_name: val_x})[0]
    print(pred)
except Exception as e:
    print("Unexpected type")
    print("{0}: {1}".format(type(e), e))

And result:

[[ 0.01409904  0.04889973  0.0068512  -0.03029526  0.03189908 -0.00803056
   0.08499128  0.17684014  0.04767798  0.03999923  0.10761195 -0.02139558
  -0.07083604  0.21018471 -0.0954253   0.06183483  0.00641072  0.06384496
   0.07886926  0.1793935   0.08014196  0.28557324 -0.00301654  0.0646928
  -0.03479696  0.05344842 -0.07249779 -0.0734601   0.01972523 -0.04917162
   0.17919122 -0.05199863  0.11306548 -0.00073111  0.0602023   0.17280908
   0.10716759  0.06920998  0.07518236 -0.1005758  -0.06912518 -0.05041961
  -0.08174316  0.09229721  0.00539807 -0.00653988 -0.01792016 -0.06297861
  -0.00426906 -0.03353138 -0.07037119 -0.05906641  0.09653544 -0.05743868
   0.0004447  -0.0169575  -0.04731843  0.00805656  0.06733166 -0.05133148
   0.09549342 -0.05889614  0.00104034 -0.00839987 -0.07568271  0.09445982
   0.06090422 -0.07335876 -0.03737445  0.21655427  0.24781345 -0.09691562
  -0.04168833  0.09711373  0.08745644  0.09150615  0.08400141  0.03679254
   0.08579402 -0.03720342  0.08905887 -0.02892478 -0.01852894  0.18236476
   0.16203596 -0.01284258  0.15244164 -0.00778199  0.00091876 -0.0007562
   0.03243485  0.0005487   0.13538565 -0.01859052 -0.02918059 -0.2117243
   0.14415158 -0.10589323 -0.02980787  0.06000031 -0.11880822  0.03605219
   0.06211431  0.15909992  0.07994679  0.09088707  0.10202844 -0.07346293
   0.03759274 -0.11829291 -0.08547547 -0.02942014  0.02702895  0.10723414
  -0.00601167 -0.08887953  0.00301785 -0.02657861  0.13011433  0.01591381
  -0.0574094  -0.0917868  -0.01761637 -0.06247395 -0.02496385  0.025118
  -0.05821754 -0.05967677]]

AastaLLL · November 10, 2020, 2:49am

Hi,

Thanks for the example.
Could you also enable the download permission of the model shared in the original post?

Thanks.

bumatov · November 10, 2020, 7:18am

Hi.
Yes, sorry, here is actual link:
https://drive.google.com/drive/folders/1xEYcoQwOew-a74c6jpxDJtKu49NNNTgW

AastaLLL · November 12, 2020, 4:45am

Hi,

Thanks.
We can reproduce this issue in our environment now.
Will update here for any progress.

AastaLLL · November 18, 2020, 9:28am

Hi,

We have confirmed this is an application issue rather than a TensorRT error.

To align the input pre-processing, we rewrite your app with python interface.
After that, we can get the same output as onnxruntime.

Please check if there is any difference in the OpenCV pre-processing.

trt.py.txt (1.7 KB)

$ /usr/src/tensorrt/bin/trtexec --onnx=model_tf_float_opset10.onnx --minShapes=128,64,3  --optShapes=128,64,3 --maxShapes=128,64,3 --dumpOutput --saveEngine=model.trt
$ python3 trt.py

Thanks.

bumatov · November 27, 2020, 8:06am

Hi, AastaLLL. Thanks for your reply. I confirm that inference using tensorrt with python works correctly. But i’m probably blind or stupid because i still can’t find any difference between c++ code and python code and still getting wrong results on c++.

So, what i did:

I made engine using trtexec command from your post
I checked that it gives correct inference results on python.
I compared preprocess in python and in c++, so:

C++:

cv::Mat img = cv::imread("cut10735.png");

cv::resize(img, img, cv::Size(64, 128));

img.convertTo(img, CV_32FC3);

cv::cuda::GpuMat gpu_dst(128 * 64 * 3, 1, CV_32F);

gpu_dst.upload(img);

// just for check
gpu_dst.download(img);
std::cout << gpu_dst.cols * gpu_dst.rows * gpu_dst.channels() << "  " << img << std::endl;

so, here we have resized to 64x128 float32 input and cout gives (here i provide just begin and end of output):
24576 [91, 92, 53, 90, 91… 50, 63, 52, 49]

Let’s compare that with python:

im = cv2.imread('cut10735.png')
im = cv2.resize(im, (64, 128))
np.copyto(host_inputs[0], im.ravel())
print(np.shape(host_inputs[0]),host_inputs[0].dtype, host_inputs[0])

print output gives:

(24576,) float32 [91. 92. 53. ... 63. 52. 49.]

as i can see, here we also have float32 input which is same as in c++

after that i need to do inference:
C++:

    cv::cuda::GpuMat output(1, 128, CV_32F);

    Mat result;

    void* buffers[2];

    int inputIndex = engine->getBindingIndex("images:0");

    int outputIndex = engine->getBindingIndex("features:0");

    buffers[inputIndex] = gpu_dst.data;

    buffers[outputIndex] = output.data;

    IExecutionContext *context = engine->createExecutionContext();

    context->setBindingDimensions(0, Dims4{1, 128, 64, 3});

    context->execute(1, buffers);

    output.download(result);

std::cout << result << std::endl;

python:

cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)
context.set_binding_shape(0, [1, 128, 64, 3])
context.execute_async(bindings=bindings, stream_handle=stream.handle, batch_size=1)
cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
stream.synchronize()
print("execute times "+str(time.time()-start_time))
output = host_outputs[0].reshape(np.concatenate(([1],engine.get_binding_shape(1))))
print(output)

I don’t see any difference here between c++ and python… but results are different…

bumatov · December 5, 2020, 10:07am

Do you have any thoughts?

AastaLLL · December 18, 2020, 7:32am

Hi,

Sorry for the late update.
Would you mind giving some in-depth investigation into the below two lines?

buffers[inputIndex] = gpu_dst.data;
buffers[outputIndex] = output.data;

Please noted that it’s expected to use the NCHW format as the TensorRT input and output.
But OpenCV may use the HWC format, which leads to the output difference.

To figure out this, you can show the buffers[inputIndex] as a 1D array to ensure all the data are aligned.
Thanks.

bumatov · December 18, 2020, 8:16am

Hi. Thanks for your answer. As you can see in my previous post, i have already did this

bumatov:

C++:
cv::Mat img = cv::imread("cut10735.png");

cv::resize(img, img, cv::Size(64, 128));

img.convertTo(img, CV_32FC3);

cv::cuda::GpuMat gpu_dst(128 * 64 * 3, 1, CV_32F);

gpu_dst.upload(img);

// just for check
gpu_dst.download(img);
std::cout << gpu_dst.cols * gpu_dst.rows * gpu_dst.channels() << "  " << img << std::endl;
so, here we have resized to 64x128 float32 input and cout gives (here i provide just begin and end of output):
24576 [91, 92, 53, 90, 91… 50, 63, 52, 49]

Let’s compare that with python:
im = cv2.imread('cut10735.png')
im = cv2.resize(im, (64, 128))
np.copyto(host_inputs[0], im.ravel())
print(np.shape(host_inputs[0]),host_inputs[0].dtype, host_inputs[0])
print output gives:

(24576,) float32 [91. 92. 53. ... 63. 52. 49.]

as i can see, here we also have float32 input which is same as in c++

AastaLLL · December 29, 2020, 8:23am

Hi,

Sorry to miss the array validation shared above.

Based on your implementation:

context->execute(1, buffers);
output.download(result);

There is no synchronized mechanism between GPU tasks and CPU tasks.
So CPU may try to copy the buffer back before the inference job is done.

Would you mind to add a synchronization call in between to see if it helps first?

context->execute(1, buffers);
cudaDeviceSynchronize();
output.download(result);

Thanks.

Topic		Replies	Views
Output from ONNX inference and trt inference are different Jetson TX2 tensorrt , tensorflow , nvbugs	6	842	October 18, 2021
Nvinfer input formats issue Jetson Nano jetson-inference	23	2397	October 12, 2021
TensorRT with ONNX model and RGB opencv data TensorRT tensorrt , opencv , cuda	6	3405	April 28, 2021
TensorRT get different result in python and c++ TensorRT	21	2930	August 24, 2022
Always same output vlaues TensorRT	10	2370	October 12, 2021
Get wrong result when I using tensorRT to do inference, am I wrong to use ? Jetson TX2	18	3023	October 18, 2021
TensorRT C++ result was wired and changed everytime I do the same inference TensorRT	3	703	May 17, 2021
Kernel weights has count 2304 but 32640 was expected Jetson TX2 tensorrt , nvbugs	23	4534	May 12, 2022
YOLOv4 TensorRT inference results wayy off, but onnxruntime is not TensorRT tensorrt	7	961	June 7, 2022
TensorRT 8 : C++ inference gives different results compared to tensorflow python inference TensorRT	7	1375	October 5, 2021

Inference of model using tensorflow/onnxruntime and TensorRT gives different result

Related topics