Inference results are partially weird

Description

I refered wang-xinyu/tensorrtx to create TensorRT network. And I tried to implement weizheliu/People-Flows using TensorRT API. After fine-tuning some operator, the program is running fine, but result is partially weird.

I used two (1, 3, 540, 960) images to testing under Pytorch and TensorRT respectively. Two runtime have the same output shape (1, 10, 67, 120). Then I tried to separate the output by channel to check counting result via sum function.

  • In Pytorch, results are
    [2.58848, 12.08908, 13.08177, 11.02262, 0.03745,
    0.03239, 0.02941, 0.09200, 0.02342, 7.01662]
  • In TensorRT, results are
    [2.31062, 13.39710, 12.2439, 8.91591, 0.01363,
    0.03388, 0.02145, 0.04163, 2.37593, -nan]

Last number in TensorRT would change every inference, it might very big, very small, or nan. It does not seem to be a problem with the definition of network using TensorRT API. Is this a problem when binding the result pointer to the vector of matrix?

In TensorRT, I combined openCV mat to store result per channel. Like this:

void* result_blob = malloc(outputSize);
cudaMemcpyAsync(result_blob, buffers[2], outputSize, cudaMemcpyDeviceToHost, stream);

...

std::vector<cv::Mat> chw_output;
for (int i = 0; i < 10; i++)
{
	chw_output.emplace_back(cv::Mat(image.rows / 8, image.cols / 8, CV_32FC1,
		(float*)result_blob + i * image.size().height/8 * image.size().width/8));
}

I assign the memory space of the result to each cv::mat by result pointer and single channel size.

What should I use to get the right cv::Mat?

Environment

TensorRT Version: 7.2.3.4
GPU Type: RTX 2070 8G
Nvidia Driver Version: 510.06
CUDA Version: 11.1
CUDNN Version: 8.1
Operating System + Version: Windows11 22000.282
Python Version (if applicable): 3.9.4
PyTorch Version (if applicable): 1.8.1+cu111
Baremetal or Container (if container which image + tag): Baremetal

Relevant Files

Steps To Reproduce

Build with VS2019 x64 Release, No error, Return 0

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Meanwhile, for some common errors and queries please refer to below link:
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/#error-messaging
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/#faq

Thanks!

Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

This model has adaptive pooling layer so i gave up to using onnx-tensorrt path.I also try NVIDIA-AI-IOT/torch2trt, but output shape and ideal shape do not match. I finally chose to use TensorRT API to define network. Like wang-xinyu/tensorrtx. I defined network in CANet2s.cpp, and tested tensorRT in same file.

Also, request you to share your model and script if not shared already so that we can help you better.

I have compiled some relevant model, code and scripts. I am not sure if this is enough to express my problem.
File on my OneDrive: Relate.zip

Main issue in this topic has solved. I carefully reviewed my entire code. There are some logic bugs observed, such as adaptive pooling in the network definition section and the standardization process. I also used Python API to load exited plan file and made same inferences.I found network I defined have bug, and method I got result image into cv::Mat of vector also has some problem.The previous question is still being explored, but the latter I have solve.It is just a little logic bug, and can be solved with a simple type conversion:

std::vector<cv::Mat> chw_output;
for (int i = 0; i < 10; i++)
{
	chw_output.emplace_back(cv::Mat(image.rows / 8, image.cols / 8, CV_32FC1,
		(float*)result_blob + i * int(image.size().height/8) * int(image.size().width/8)));
}