Python serialized TensorRT engine output wrong data at TensorRT C++ runtime

Description

Hi! I have referred this and here is my case:
I use torch2trt to transfer my PyTorch model (ERFNet) to TensorRT engine.
Since the TensorRT api in torch2trt require version higher than 5, I modified some of the api to fit my version TensorRT 4 because this version is compatible with my DRIVE PX2 (driveworks-1.2, CUDA-9.2) in which I want to port my TensorRT engine.
I think i can transfer the model successfully and below is my model’s architecture output from torch2trt.

torch.Tensor.get_device
torch.nn.Conv2d.forward
torch.nn.functional.max_pool2d
torch.cat
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.max_pool2d
torch.cat
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.Dropout2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.Dropout2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.Dropout2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.Dropout2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.Dropout2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.max_pool2d
torch.cat
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.Dropout2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.Dropout2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.Dropout2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.Dropout2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.Dropout2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.Dropout2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.Dropout2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.Dropout2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.ConvTranspose2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.ConvTranspose2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.functional.relu
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.Tensor.__add__
torch.nn.functional.relu
torch.nn.ConvTranspose2d.forward
torch.nn.Conv2d.forward
torch.nn.BatchNorm2d.forward
torch.nn.functional.relu
torch.nn.Dropout2d.forward
torch.nn.Conv2d.forward
torch.nn.functional.softmax
torch.nn.functional.max_pool2d
torch.Tensor.view
torch.nn.Linear.forward
torch.nn.functional.relu
torch.nn.Linear.forward
torch.nn.functional.sigmoid

The model’s input size is (1 * 3 * 208 * 976) and outsize is (1 * 5 * 202 * 970 + 4).
I use tensorrt.utils.write_engine_to_file to save my serialized engine.
After generating engine, I want to deserialize it and inference in C++.

std::cout << "*** Deserializing ***" << std::endl;
mTrtRunTime = createInferRuntime(gLogger);
assert(mTrtRunTime != nullptr);
mTrtEngine= mTrtRunTime->deserializeCudaEngine(data.get(), length, nullptr);
assert(mTrtEngine != nullptr);

std::cout << "*** Initialize the engine ***" << std::endl;        
const int maxBatchSize = 1;
mTrtContext = mTrtEngine->createExecutionContext();
assert(mTrtContext != nullptr);
mTrtContext->setProfiler(&mTrtProfiler);

exactly IEngine::getNbBindings()
int nbBindings = mTrtEngine->getNbBindings();
mTrtCudaBuffer.resize(nbBindings);
mTrtBindBufferSize.resize(nbBindings);
for (int i = 0; i < nbBindings; ++i)
{
    Dims dims = mTrtEngine->getBindingDimensions(i);
    DataType dtype = mTrtEngine->getBindingDataType(i);
    int64_t totalSize = volume(dims) * maxBatchSize * getElementSize(dtype);
    mTrtBindBufferSize[i] = totalSize;
    mTrtCudaBuffer[i] = safeCudaMalloc(totalSize);
    if(mTrtEngine->bindingIsInput(i))
        mTrtInputCount++;
}
CUDA_CHECK(cudaStreamCreate(&mTrtCudaStream));        

Until here, no error occurred.
Then I try to input a random float vector into the model and do inference. As I get output, it shows 0 or nan all over the vector. I already check the data type but it still didn’t show any normal number.

vector<float> inputData(h * w * c); // 208 * 976 * 3
std::generate(inputData.begin(), inputData.end(), []() {
    return float(rand() % 255);
});
vector<float> outputData;
outputData.resize(net.getOutputSize()/sizeof(float));

std::cout << "*** Inference ***" << std::endl;
static const int batchSize = 1;
assert(mTrtInputCount == 1);

int inputIndex = 0;
CUDA_CHECK(cudaMemcpyAsync(mTrtCudaBuffer[inputIndex], inputData.data(), mTrtBindBufferSize[inputIndex], cudaMemcpyHostToDevice, mTrtCudaStream));
auto t_start = std::chrono::high_resolution_clock::now();
mTrtContext->execute(batchSize, &mTrtCudaBuffer[inputIndex]);
auto t_end = std::chrono::high_resolution_clock::now();

float total = std::chrono::duration<float, std::milli>(t_end - t_start).count();
std::cout << "Time taken for inference is " << total << " ms." << std::endl;

for (size_t bindingIdx = mTrtInputCount; bindingIdx < mTrtBindBufferSize.size(); ++bindingIdx) {
    auto size = mTrtBindBufferSize[bindingIdx];
    CUDA_CHECK(cudaMemcpyAsync(outputData.data(), mTrtCudaBuffer[bindingIdx], size, cudaMemcpyDeviceToHost, mTrtCudaStream));
    outputData.data() = (char *)outputData.data() + size;
}
mTrtIterationTime ++ ;

How can I solve this kind of problem? Is this something to do with TensorRT version? How can I know that the generated engine really work?

Environment

TensorRT Version: 4.0.1.6
GPU Type: GeForce GTX 1070
Nvidia Driver Version: 396.26
CUDA Version: 9.2
CUDNN Version: 7
Operating System + Version: Ubuntu 16.04
Python Version (if applicable): 2.7
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.0.0
Baremetal or Container (if container which image + tag):

Relevant Files

ERFNet_trt.engine

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,

TRT 4.0 is very old version, will recommend you to use latest supported TRT version on your device.
In order to verify the model you can use the trtexec command line tool:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt_401/tensorrt-developer-guide/index.html#giexec

Thanks

Dear @SunilJB:
Thanks for your reply.
But here I have a few questions.
According to the document, CUDA-9.2 is no longer supported by the TensorRT version 5 or later version. It seems that only CUDA-10.x are supported. However, my DRIVE PX2 has driveworks-1.2 with CUDA-9.2. Is it possible to deploy my engine generated by using TensorRT 5 api on it? Thanks!

Hi,

I don’t have much knowledge of DRIVE PX2 platform, will recommend to post query in below forum so that DRIVE PX2 team can take a look:
https://forums.developer.nvidia.com/c/agx-autonomous-machines/drive-px2/61

Thanks

Dear @SunilJB :
OK! I will post my question over there.
Thanks for your help!