TensorRT affects memory access speed

While working on an implementation of YoloV4 using TensorRT I noticed, that the iteration runtime (image preprocessing, inference, and image post-processing) differs between a run where the engine was build from the ONNX model and a run where the engine was loaded. When loading the engine, the runtime was slower compared to when the engine was build from scratch. After comparing the runtimes of the different steps I noticed, that the runtime of the preprocessing step was 2 to 3 times slower when the engine was loaded compared to when it was built. After more testing, I noticed, that after calling the function nvinfer1::createInferBuilder, the runtime of the preprocessing step decreases by the previously mentioned factor of 2 to 3. It appears as if createInferBuilder performs some actions that affect the way memory is accessed and thereby reducing the runtime.

I included the test application I wrote to showcase the described behavior as well as the output of the two tests. The first test executes 10 iterations of the image preprocessing step before createInferBuilder has been called and the second test after it has been called.

Can someone explain this behavior and also include the steps required to achieve this without the need to call createInferBuilder?

Test Application
#include <iostream>
#include <vector>

#include <NvInfer.h>
#include <opencv2/opencv.hpp>

class Logger : public nvinfer1::ILogger
{
	void log(Severity severity, const char *msg) override
	{
		// suppress info and verbose messages
		if (severity != Severity::kINFO && severity != Severity::kVERBOSE)
			std::cout << msg << std::endl;
	}
} gLogger;

class Timer
{
	using Clock = std::chrono::high_resolution_clock;

public:
	void Start()
	{
		m_stopped   = false;
		m_StartTime = Clock::now();
	}

	void Stop()
	{
		m_stopped = true;
		m_EndTime = Clock::now();
	}

	double GetElapsedTimeInMilliSec() const
	{
		return std::chrono::duration_cast<std::chrono::microseconds>(getTimeDiff()).count() * 1.0e-3;
	}

	friend std::ostream &operator<<(std::ostream &stream, const Timer &t)
	{
		stream << t.GetElapsedTimeInMilliSec() << " ms";
		return stream;
	}

private:
	Clock::duration getTimeDiff() const
	{
		if (!m_stopped)
			return (Clock::now() - m_StartTime);
		else
			return (m_EndTime - m_StartTime);
	}

private:
	bool m_stopped;

	Clock::time_point m_StartTime;
	Clock::time_point m_EndTime;
};

std::vector<float> test(const cv::Mat &image)
{
	Timer t1;
	Timer t11;
	Timer t12;
	Timer t13;
	Timer t2;
	Timer t21;
	Timer t22;
	Timer t3;

	t3.Start();
	t1.Start();
	// ==== Pre-Process Image ====
	std::vector<float> data;
	cv::Mat scaledImg;
	cv::Mat imgConv;
	cv::Mat imgConv2;
	t11.Start();
	cv::resize(image, scaledImg, cv::Size(608, 608));
	t11.Stop();

	t12.Start();
	cv::cvtColor(scaledImg, imgConv, cv::COLOR_BGR2RGB);
	t12.Stop();
	t13.Start();
	imgConv.convertTo(imgConv2, CV_32FC3, 1 / 255.0);
	t13.Stop();

	t1.Stop();

	t2.Start();
	std::vector<cv::Mat> channles(3);
	t21.Start();
	cv::split(imgConv2, channles);
	t21.Stop();
	float *ptr1 = (float *)(channles[0].data);
	float *ptr2 = (float *)(channles[1].data);
	float *ptr3 = (float *)(channles[2].data);
	t22.Start();
	data.insert(data.end(), ptr1, ptr1 + 608 * 608);
	data.insert(data.end(), ptr2, ptr2 + 608 * 608);
	data.insert(data.end(), ptr3, ptr3 + 608 * 608);
	t22.Stop();

	t2.Stop();
	t3.Stop();
	// ==== Pre-Process Image ====

	std::cout << "T1: " << t1 << " | T1.1: " << t11 << " | T1.2: " << t12 << " | T1.3: " << t13 << " | T2: " << t2 << " | T2.1: " << t21 << " | T2.2: " << t22 << " | T3: " << t3 << std::endl;

	return data;
}

#define ITERATIONS 10

int main(int argc, char **argv)
{
	std::vector<std::vector<float>> dd;
	cv::Mat img = cv::imread("dog.jpg");

	for (uint32_t i = 0; i < ITERATIONS; i++)
		dd.push_back(test(img));

	std::cout << "Test 1 DONE" << std::endl;
	getchar();

	nvinfer1::IBuilder *builder = nvinfer1::createInferBuilder(gLogger);
	for (uint32_t i = 0; i < ITERATIONS; i++)
		dd.push_back(test(img));

	std::cout << "Test 2 DONE" << std::endl;
	getchar();

	// Prevent the compiler from optimizing out the results
	for (const std::vector<float> &a : dd)
	{
		for (const float &elem : a)
			std::cout << elem << std::endl;
	}

	return 0;
}
Test 1 Output

T1: 3.241 ms | T1.1: 1.837 ms | T1.2: 0.569 ms | T1.3: 0.834 ms | T2: 8.18 ms | T2.1: 4.233 ms | T2.2: 3.941 ms | T3: 11.422 ms
T1: 3.978 ms | T1.1: 1.158 ms | T1.2: 0.443 ms | T1.3: 2.374 ms | T2: 7.918 ms | T2.1: 4.152 ms | T2.2: 3.764 ms | T3: 11.897 ms
T1: 1.268 ms | T1.1: 0.741 ms | T1.2: 0.122 ms | T1.3: 0.403 ms | T2: 7.658 ms | T2.1: 4.04 ms | T2.2: 3.615 ms | T3: 8.927 ms
T1: 1.296 ms | T1.1: 0.777 ms | T1.2: 0.12 ms | T1.3: 0.398 ms | T2: 9.16 ms | T2.1: 4.045 ms | T2.2: 5.112 ms | T3: 10.458 ms
T1: 1.195 ms | T1.1: 0.697 ms | T1.2: 0.114 ms | T1.3: 0.381 ms | T2: 8.741 ms | T2.1: 4.056 ms | T2.2: 4.682 ms | T3: 9.937 ms
T1: 1.485 ms | T1.1: 0.847 ms | T1.2: 0.254 ms | T1.3: 0.383 ms | T2: 8.491 ms | T2.1: 4.094 ms | T2.2: 4.394 ms | T3: 9.977 ms
T1: 1.324 ms | T1.1: 0.743 ms | T1.2: 0.156 ms | T1.3: 0.423 ms | T2: 8.116 ms | T2.1: 4.022 ms | T2.2: 4.091 ms | T3: 9.441 ms
T1: 1.316 ms | T1.1: 0.736 ms | T1.2: 0.196 ms | T1.3: 0.382 ms | T2: 7.859 ms | T2.1: 4.053 ms | T2.2: 3.803 ms | T3: 9.177 ms
T1: 1.222 ms | T1.1: 0.614 ms | T1.2: 0.204 ms | T1.3: 0.401 ms | T2: 7.55 ms | T2.1: 4.055 ms | T2.2: 3.493 ms | T3: 8.773 ms
T1: 1.201 ms | T1.1: 0.679 ms | T1.2: 0.139 ms | T1.3: 0.382 ms | T2: 9.291 ms | T2.1: 4.11 ms | T2.2: 5.179 ms | T3: 10.493 ms

Test 2 Output

T1: 7.958 ms | T1.1: 3.812 ms | T1.2: 1.927 ms | T1.3: 2.216 ms | T2: 8.908 ms | T2.1: 3.849 ms | T2.2: 5.055 ms | T3: 16.867 ms
T1: 2.599 ms | T1.1: 1.809 ms | T1.2: 0.332 ms | T1.3: 0.455 ms | T2: 2.536 ms | T2.1: 0.538 ms | T2.2: 1.995 ms | T3: 5.135 ms
T1: 1.832 ms | T1.1: 1.032 ms | T1.2: 0.373 ms | T1.3: 0.425 ms | T2: 2.789 ms | T2.1: 0.513 ms | T2.2: 2.274 ms | T3: 4.623 ms
T1: 1.512 ms | T1.1: 0.79 ms | T1.2: 0.319 ms | T1.3: 0.4 ms | T2: 3.012 ms | T2.1: 0.604 ms | T2.2: 2.406 ms | T3: 4.525 ms
T1: 1.812 ms | T1.1: 0.891 ms | T1.2: 0.472 ms | T1.3: 0.447 ms | T2: 1.903 ms | T2.1: 0.589 ms | T2.2: 1.311 ms | T3: 3.716 ms
T1: 2.059 ms | T1.1: 1.404 ms | T1.2: 0.263 ms | T1.3: 0.39 ms | T2: 2.014 ms | T2.1: 0.548 ms | T2.2: 1.463 ms | T3: 4.074 ms
T1: 1.935 ms | T1.1: 1.165 ms | T1.2: 0.378 ms | T1.3: 0.39 ms | T2: 2.283 ms | T2.1: 0.546 ms | T2.2: 1.735 ms | T3: 4.22 ms
T1: 2.448 ms | T1.1: 0.874 ms | T1.2: 1.165 ms | T1.3: 0.406 ms | T2: 2.614 ms | T2.1: 0.544 ms | T2.2: 2.066 ms | T3: 5.062 ms
T1: 1.824 ms | T1.1: 1.067 ms | T1.2: 0.377 ms | T1.3: 0.378 ms | T2: 2.903 ms | T2.1: 0.596 ms | T2.2: 2.305 ms | T3: 4.728 ms
T1: 1.884 ms | T1.1: 1.219 ms | T1.2: 0.284 ms | T1.3: 0.379 ms | T2: 3.356 ms | T2.1: 0.565 ms | T2.2: 2.755 ms | T3: 5.241 ms

Hi,

Did you maximize the device performance first?
For benchmarking, it’s recommended to fix the clocks rate to avoid it causing any difference.

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

By default, since dynamic mode is applied, it is possible that some intensive job that cause the clocks rate rising.

Thanks.

Hi,

maximizing the device performance did not change the behavior.
There is still a noticeable performance increase after createInferBuilder has been called.

Hi AastaLLL,

it would be very awful if that would make a difference, i think.
Does tensorrt increase the clock or the same from the xavier? ’
I would like to use a lower power level as i have limited power budget.

Does this mean that I initialize a parser in every program to increase performance?

Greeting Nils

Hi AastaLLL,

do you have any news regarding this problem?

Hi,

Sorry for the late update.
We are checking this internally and will share more information later.

Thanks.

Hi,

Thanks for your patience.
We try your sample with an 1150x1150 input but cannot reproduce the issue.

Since TensorRT and OpenCV are separate libraries, it should not affect each other.
A possible reason is related to memory allocation and access speed.
Since TensorRT allocates a large amount of memory, the allocation may become faster due to the OS mechanism.

Below is our testing score for your reference.
You can see the initial run tends to be longer but improved after a few “warm-up” time.
After the warm-up, the performance is roughly the same between Test-1 and Test-2.

T1: 11.23 ms | T1.1: 6.827 ms | T1.2: 0.638 ms | T1.3: 3.76 ms | T2: 8.445 ms | T2.1: 3.011 ms | T2.2: 5.428 ms | T3: 19.676 ms
T1: 5.823 ms | T1.1: 3.647 ms | T1.2: 0.155 ms | T1.3: 2.018 ms | T2: 6.772 ms | T2.1: 3.552 ms | T2.2: 3.213 ms | T3: 12.596 ms
T1: 4.723 ms | T1.1: 3.615 ms | T1.2: 0.156 ms | T1.3: 0.949 ms | T2: 4.417 ms | T2.1: 1.803 ms | T2.2: 2.609 ms | T3: 9.141 ms
T1: 4.959 ms | T1.1: 2.741 ms | T1.2: 0.123 ms | T1.3: 2.093 ms | T2: 5.034 ms | T2.1: 1.764 ms | T2.2: 3.264 ms | T3: 9.994 ms
T1: 5.467 ms | T1.1: 3.553 ms | T1.2: 0.19 ms | T1.3: 1.721 ms | T2: 6.007 ms | T2.1: 1.771 ms | T2.2: 4.23 ms | T3: 11.475 ms
T1: 4.828 ms | T1.1: 1.865 ms | T1.2: 0.143 ms | T1.3: 2.817 ms | T2: 3.197 ms | T2.1: 0.499 ms | T2.2: 2.693 ms | T3: 8.026 ms
T1: 2.421 ms | T1.1: 1.884 ms | T1.2: 0.133 ms | T1.3: 0.402 ms | T2: 2.222 ms | T2.1: 0.618 ms | T2.2: 1.598 ms | T3: 4.644 ms
T1: 2.474 ms | T1.1: 1.939 ms | T1.2: 0.157 ms | T1.3: 0.376 ms | T2: 1.895 ms | T2.1: 0.55 ms | T2.2: 1.341 ms | T3: 4.371 ms
T1: 2.466 ms | T1.1: 1.941 ms | T1.2: 0.147 ms | T1.3: 0.376 ms | T2: 3.376 ms | T2.1: 0.59 ms | T2.2: 2.78 ms | T3: 5.843 ms
T1: 2.367 ms | T1.1: 1.839 ms | T1.2: 0.135 ms | T1.3: 0.39 ms | T2: 2.673 ms | T2.1: 0.487 ms | T2.2: 2.182 ms | T3: 5.041 ms
Test 1 DONE

T1: 4.246 ms | T1.1: 2.215 ms | T1.2: 1.011 ms | T1.3: 1.016 ms | T2: 7.066 ms | T2.1: 3.135 ms | T2.2: 3.924 ms | T3: 11.313 ms
T1: 3.462 ms | T1.1: 2.781 ms | T1.2: 0.176 ms | T1.3: 0.503 ms | T2: 4.508 ms | T2.1: 0.535 ms | T2.2: 3.968 ms | T3: 7.971 ms
T1: 2.592 ms | T1.1: 2.061 ms | T1.2: 0.172 ms | T1.3: 0.357 ms | T2: 2.742 ms | T2.1: 0.552 ms | T2.2: 2.184 ms | T3: 5.335 ms
T1: 2.465 ms | T1.1: 1.959 ms | T1.2: 0.146 ms | T1.3: 0.357 ms | T2: 4.397 ms | T2.1: 0.543 ms | T2.2: 3.849 ms | T3: 6.864 ms
T1: 2.267 ms | T1.1: 1.747 ms | T1.2: 0.143 ms | T1.3: 0.374 ms | T2: 2.646 ms | T2.1: 0.525 ms | T2.2: 2.115 ms | T3: 4.914 ms
T1: 2.383 ms | T1.1: 1.87 ms | T1.2: 0.14 ms | T1.3: 0.37 ms | T2: 4.448 ms | T2.1: 0.576 ms | T2.2: 3.866 ms | T3: 6.832 ms
T1: 2.543 ms | T1.1: 2.051 ms | T1.2: 0.132 ms | T1.3: 0.357 ms | T2: 2.584 ms | T2.1: 0.512 ms | T2.2: 2.067 ms | T3: 5.128 ms
T1: 2.385 ms | T1.1: 1.891 ms | T1.2: 0.14 ms | T1.3: 0.351 ms | T2: 2.773 ms | T2.1: 0.486 ms | T2.2: 2.284 ms | T3: 5.159 ms
T1: 2.396 ms | T1.1: 1.863 ms | T1.2: 0.179 ms | T1.3: 0.351 ms | T2: 3.13 ms | T2.1: 0.499 ms | T2.2: 2.626 ms | T3: 5.527 ms
T1: 2.435 ms | T1.1: 1.915 ms | T1.2: 0.152 ms | T1.3: 0.365 ms | T2: 1.84 ms | T2.1: 0.486 ms | T2.2: 1.348 ms | T3: 4.276 ms
Test 2 DONE

Thanks.