Tensorrt C++ not working as python version and gives wrong results

Description

I Converted TensorFlow weights to the ONNX model and tried it in C++. But the C++ results didn’t match the Python ones. So, I checked the Python version using TensorRT, and it worked right. You can check this too - I’ve put both Python and C++ codes in a GitHub repository. There, you can also find steps to try it yourself. I’ve made the example as simple as possible, so I think the issue is probably with TensorRT. Weights included in there.

Environment

TensorRT Version: TensorRT-8.4.3.1, also two different version tested
GPU Type: 2080ti
Nvidia Driver Version: 536.67
CUDA Version: 11.5
CUDNN Version: cudnn-windows-x86_64-8.6.0.163_cuda11
Operating System + Version: Windows 10
Python Version (if applicable): 3.9
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Hi,

We recommend that you please try the latest TensorRT version 8.6.1 and let us know if you still face the same issue. Also, please make sure your inference script and if your are doing post processing it is correct.

Thank you.

@spolisetty I appreciate your response. For your information, I have attempted multiple versions like TensorRT-8.6.1.6 and 8.2. Furthermore, to make testing straightforward, I have simplified the script, eliminating any pre-processing and post-processing in this scenario. The input binding will be filled with constant variable to make sure that we are not doing something wrong between python and c++.

python input data:

    # data feed, I've supposed that all of the image pixels after preprocessing is -0.99609375
    image = np.ones((1, 112, 112, 3), np.float32) * -0.99609375

C++ Input data:

    // Allocate memory on the GPU for the input and output data
    float *input_data;
    float *output_data;
    int input_size = 1 * 112 * 112 * 3; // As per your Python code
    int output_size = 512; // You need to set the output size here
    cudaMalloc((void **) &input_data, input_size * sizeof(float));
    cudaMalloc((void **) &output_data, output_size * sizeof(float));

    // Set the input data
    float input_value = -0.99609375;
    cudaMemset(input_data, input_value, input_size * sizeof(float));

Full C++ code:

#include <iostream>
#include <fstream>
#include <NvInfer.h>
#include <cuda_runtime_api.h>
#include "NvOnnxParser.h"

using namespace nvinfer1;

// Simple Logger for TensorRT
class Logger : public nvinfer1::ILogger {
public:
    void log(Severity severity, const char *msg) noexcept override {
        // suppress info-level messages
        std::cout << msg << std::endl;
    }
} gLogger;

int main() {
    std::string engine_file = "Example.engine";

    // Create a TensorRT runtime
    IRuntime *runtime = createInferRuntime(gLogger);

    // Read the engine file
    std::ifstream engineStream(engine_file, std::ios::binary);
    std::string engineString((std::istreambuf_iterator<char>(engineStream)), std::istreambuf_iterator<char>());
    engineStream.close();

    // Deserialize the engine
    ICudaEngine *engine = runtime->deserializeCudaEngine(engineString.data(), engineString.size(), nullptr);

    // Create an execution context
    IExecutionContext *context = engine->createExecutionContext();

    // Allocate memory on the GPU for the input and output data
    float *input_data;
    float *output_data;
    int input_size = 1 * 112 * 112 * 3;
    int output_size = 512;
    cudaMalloc((void **) &input_data, input_size * sizeof(float));
    cudaMalloc((void **) &output_data, output_size * sizeof(float));

    // Set the input data
    float input_value = -0.99609375;
    cudaMemset(input_data, input_value, input_size * sizeof(float));

    // Set up the execution bindings
    void *bindings[2] = {input_data, output_data};

    // Run inference
    context->executeV2(bindings);

    // Copy the output data back to the host
    float *host_output = new float[output_size];
    cudaMemcpy(host_output, output_data, output_size * sizeof(float), cudaMemcpyDeviceToHost);

    // Print the output data
    for (int i = 0; i < 10; ++i) {
        std::cout << host_output[i] << " \n";
    }
    std::cout << std::endl;

    // Clean up
    cudaFree(input_data);
    cudaFree(output_data);
    delete[] host_output;
    context->destroy();
    engine->destroy();
    runtime->destroy();

    return 0;
}

Using cudaMemset in here was wrong and now the C++ also generate same responses.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.