Misaligned Address on repetitive run of IExecutionContext

Description

Hello,

I am trying to run a TensorRT engine on a video on Jetson AGX platform. I have used one of your sample codes to build and infer the engine on a single image. It works alright. When I wanted to use the infer method repetitively I have seen that the overall time spent in the code was huge. The reason for this was that I was creating the execution context each time I ran the engine.

infer()
{

samplesCommon::BufferManager buffers(this->mEngine, mParams.batchSize);
auto context = std::shared_ptrnvinfer1::IExecutionContext(this->mEngine->createExecutionContext(), samplesCommon::InferDeleter());
buffers.copyInputToDevice();
bool status = context->executeV2(buffers.getDeviceBindings().data());
buffers.copyOutputToHost();

}

and so on.

This works for single image. But createExecutionContext() from the engine takes 20-30 milliseconds so I tried to define it outside the infer method. The cascades of failures lead me to one such test code that does this:

samplesCommon::BufferManager buffers(this->mEngine, mParams.batchSize);
auto context = std::shared_ptrnvinfer1::IExecutionContext(this->mEngine->createExecutionContext(), samplesCommon::InferDeleter());
buffers.copyInputToDevice(); //I ALSO TRIED THIS INSIDE THE LOOP, RESULT=SAME
while(1)
{
std::cout beginning
bool status = context->executeV2(buffers.getDeviceBindings().data());
buffers.copyOutputToHost();
std::cout ending
}

And then all hell broke loose after the first loop. First execution passes, beginning and ending prints come thru, the errors flow after the second execution of context.

[01/04/2022-08:52:53] [E] [TRT] …/rtExt/cuda/pointwiseV2Helpers.h (538) - Cuda Error in launchPwgenKernel: 716 (misaligned address)
[01/04/2022-08:52:53] [E] [TRT] FAILED_EXECUTION: std::exception
[01/04/2022-08:52:53] [E] [TRT] engine.cpp (179) - Cuda Error in ~ExecutionContext: 716 (misaligned address)
[01/04/2022-08:52:53] [E] [TRT] INTERNAL_ERROR: std::exception
[01/04/2022-08:52:53] [E] [TRT] Parameter check failed at: …/rtSafe/safeContext.cpp::terminateCommonContext::155, condition: cudnnDestroy(context.cudnn) failure.
[01/04/2022-08:52:53] [E] [TRT] Parameter check failed at: …/rtSafe/safeContext.cpp::terminateCommonContext::165, condition: cudaEventDestroy(context.start) failure.
[01/04/2022-08:52:53] [E] [TRT] Parameter check failed at: …/rtSafe/safeContext.cpp::terminateCommonContext::170, condition: cudaEventDestroy(context.stop) failure.
[01/04/2022-08:52:53] [E] [TRT] …/rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 716 (misaligned address)

I tried using cudaDeviceSynchronize(); from another suggestion on a similar topic, it changed nothing. Tried unique_ptr and shared_ptr on context, didn’t change a thing. So my questions are:

  1. Is it not possible to use a context more than once?
  2. What is the solution to this situation. The example Python codes seem to pass context from method to method over and over and they seem to work fine. So what makes C++ code any different?

Please do not copy paste me links of educational sources, they don’t solve my problems, ever.

Thanks in advance,
Cem

Environment

TensorRT Version: 7.1.3
GPU Type: Jetson AGX
CUDA Version: 10.2
Operating System + Version: Ubuntu 18.04

Hi,
Please refer to the below link for Sample guide.

Refer to the installation steps from the link if in case you are missing on anything

However suggested approach is to use TRT NGC containers to avoid any system dependency related issues.

In order to run python sample, make sure TRT python packages are installed while using NGC container.
/opt/tensorrt/python/python_setup.sh

In case, if you are trying to run custom model, please share your model and script with us, so that we can assist you better.
Thanks!

Hello @NVES

Is there an official customer support where I can file cases for my problems? This forum is going to give me cancer.

The question is really simple why does it fail when I run IExecutionContext->executeV2 two times.

copyDataToDevice
executeV2
copyDataFromDevice

copySameDataToDevice
executeV2
→ Fails here

executeV2() has some problems to it. Devs need to make the code less susceptible to parameter errors.

execute() was giving an error about batchSize == 0 || batchSize <= Engine->getMaxBatchSize .

Removed builder->setMaxBatchSize() flag from builder. Built the engine again.

Now execute() works as well as executeV2().

You can remove that from example code in this line yolov4_deepstream/SampleYolo.cpp at master · NVIDIA-AI-IOT/yolov4_deepstream · GitHub

or make batchNumber 1 instead of 0 in this line yolov4_deepstream/main.cpp at master · NVIDIA-AI-IOT/yolov4_deepstream · GitHub
But it will make this line problematic, cause Common/BufferManager will fail at this assertion: assert(engine->hasImplicitBatchDimension() || mBatchSize == 0); yolov4_deepstream/SampleYolo.cpp at master · NVIDIA-AI-IOT/yolov4_deepstream · GitHub

If this repo does not belong to Nvidia please file a complaint, they carry your logo and address, if it belongs to Nvidia please moderate.

You are welcome,
Cem