TensorRT affects memory access speed

While working on an implementation of YoloV4 using TensorRT I noticed, that the iteration runtime (image preprocessing, inference, and image post-processing) differs between a run where the engine was build from the ONNX model and a run where the engine was loaded. When loading the engine, the runtime was slower compared to when the engine was build from scratch. After comparing the runtimes of the different steps I noticed, that the runtime of the preprocessing step was 2 to 3 times slower when the engine was loaded compared to when it was built. After more testing, I noticed, that after calling the function nvinfer1::createInferBuilder, the runtime of the preprocessing step decreases by the previously mentioned factor of 2 to 3. It appears as if createInferBuilder performs some actions that affect the way memory is accessed and thereby reducing the runtime.

I included the test application I wrote to showcase the described behavior as well as the output of the two tests. The first test executes 10 iterations of the image preprocessing step before createInferBuilder has been called and the second test after it has been called.

Can someone explain this behavior and also include the steps required to achieve this without the need to call createInferBuilder?

Test Application
#include <iostream>
#include <vector>

#include <NvInfer.h>
#include <opencv2/opencv.hpp>

class Logger : public nvinfer1::ILogger
{
	void log(Severity severity, const char *msg) override
	{
		// suppress info and verbose messages
		if (severity != Severity::kINFO && severity != Severity::kVERBOSE)
			std::cout << msg << std::endl;
	}
} gLogger;

class Timer
{
	using Clock = std::chrono::high_resolution_clock;

public:
	void Start()
	{
		m_stopped   = false;
		m_StartTime = Clock::now();
	}

	void Stop()
	{
		m_stopped = true;
		m_EndTime = Clock::now();
	}

	double GetElapsedTimeInMilliSec() const
	{
		return std::chrono::duration_cast<std::chrono::microseconds>(getTimeDiff()).count() * 1.0e-3;
	}

	friend std::ostream &operator<<(std::ostream &stream, const Timer &t)
	{
		stream << t.GetElapsedTimeInMilliSec() << " ms";
		return stream;
	}

private:
	Clock::duration getTimeDiff() const
	{
		if (!m_stopped)
			return (Clock::now() - m_StartTime);
		else
			return (m_EndTime - m_StartTime);
	}

private:
	bool m_stopped;

	Clock::time_point m_StartTime;
	Clock::time_point m_EndTime;
};

std::vector<float> test(const cv::Mat &image)
{
	Timer t1;
	Timer t11;
	Timer t12;
	Timer t13;
	Timer t2;
	Timer t21;
	Timer t22;
	Timer t3;

	t3.Start();
	t1.Start();
	// ==== Pre-Process Image ====
	std::vector<float> data;
	cv::Mat scaledImg;
	cv::Mat imgConv;
	cv::Mat imgConv2;
	t11.Start();
	cv::resize(image, scaledImg, cv::Size(608, 608));
	t11.Stop();

	t12.Start();
	cv::cvtColor(scaledImg, imgConv, cv::COLOR_BGR2RGB);
	t12.Stop();
	t13.Start();
	imgConv.convertTo(imgConv2, CV_32FC3, 1 / 255.0);
	t13.Stop();

	t1.Stop();

	t2.Start();
	std::vector<cv::Mat> channles(3);
	t21.Start();
	cv::split(imgConv2, channles);
	t21.Stop();
	float *ptr1 = (float *)(channles[0].data);
	float *ptr2 = (float *)(channles[1].data);
	float *ptr3 = (float *)(channles[2].data);
	t22.Start();
	data.insert(data.end(), ptr1, ptr1 + 608 * 608);
	data.insert(data.end(), ptr2, ptr2 + 608 * 608);
	data.insert(data.end(), ptr3, ptr3 + 608 * 608);
	t22.Stop();

	t2.Stop();
	t3.Stop();
	// ==== Pre-Process Image ====

	std::cout << "T1: " << t1 << " | T1.1: " << t11 << " | T1.2: " << t12 << " | T1.3: " << t13 << " | T2: " << t2 << " | T2.1: " << t21 << " | T2.2: " << t22 << " | T3: " << t3 << std::endl;

	return data;
}

#define ITERATIONS 10

int main(int argc, char **argv)
{
	std::vector<std::vector<float>> dd;
	cv::Mat img = cv::imread("dog.jpg");

	for (uint32_t i = 0; i < ITERATIONS; i++)
		dd.push_back(test(img));

	std::cout << "Test 1 DONE" << std::endl;
	getchar();

	nvinfer1::IBuilder *builder = nvinfer1::createInferBuilder(gLogger);
	for (uint32_t i = 0; i < ITERATIONS; i++)
		dd.push_back(test(img));

	std::cout << "Test 2 DONE" << std::endl;
	getchar();

	// Prevent the compiler from optimizing out the results
	for (const std::vector<float> &a : dd)
	{
		for (const float &elem : a)
			std::cout << elem << std::endl;
	}

	return 0;
}
Test 1 Output

T1: 3.241 ms | T1.1: 1.837 ms | T1.2: 0.569 ms | T1.3: 0.834 ms | T2: 8.18 ms | T2.1: 4.233 ms | T2.2: 3.941 ms | T3: 11.422 ms
T1: 3.978 ms | T1.1: 1.158 ms | T1.2: 0.443 ms | T1.3: 2.374 ms | T2: 7.918 ms | T2.1: 4.152 ms | T2.2: 3.764 ms | T3: 11.897 ms
T1: 1.268 ms | T1.1: 0.741 ms | T1.2: 0.122 ms | T1.3: 0.403 ms | T2: 7.658 ms | T2.1: 4.04 ms | T2.2: 3.615 ms | T3: 8.927 ms
T1: 1.296 ms | T1.1: 0.777 ms | T1.2: 0.12 ms | T1.3: 0.398 ms | T2: 9.16 ms | T2.1: 4.045 ms | T2.2: 5.112 ms | T3: 10.458 ms
T1: 1.195 ms | T1.1: 0.697 ms | T1.2: 0.114 ms | T1.3: 0.381 ms | T2: 8.741 ms | T2.1: 4.056 ms | T2.2: 4.682 ms | T3: 9.937 ms
T1: 1.485 ms | T1.1: 0.847 ms | T1.2: 0.254 ms | T1.3: 0.383 ms | T2: 8.491 ms | T2.1: 4.094 ms | T2.2: 4.394 ms | T3: 9.977 ms
T1: 1.324 ms | T1.1: 0.743 ms | T1.2: 0.156 ms | T1.3: 0.423 ms | T2: 8.116 ms | T2.1: 4.022 ms | T2.2: 4.091 ms | T3: 9.441 ms
T1: 1.316 ms | T1.1: 0.736 ms | T1.2: 0.196 ms | T1.3: 0.382 ms | T2: 7.859 ms | T2.1: 4.053 ms | T2.2: 3.803 ms | T3: 9.177 ms
T1: 1.222 ms | T1.1: 0.614 ms | T1.2: 0.204 ms | T1.3: 0.401 ms | T2: 7.55 ms | T2.1: 4.055 ms | T2.2: 3.493 ms | T3: 8.773 ms
T1: 1.201 ms | T1.1: 0.679 ms | T1.2: 0.139 ms | T1.3: 0.382 ms | T2: 9.291 ms | T2.1: 4.11 ms | T2.2: 5.179 ms | T3: 10.493 ms

Test 2 Output

T1: 7.958 ms | T1.1: 3.812 ms | T1.2: 1.927 ms | T1.3: 2.216 ms | T2: 8.908 ms | T2.1: 3.849 ms | T2.2: 5.055 ms | T3: 16.867 ms
T1: 2.599 ms | T1.1: 1.809 ms | T1.2: 0.332 ms | T1.3: 0.455 ms | T2: 2.536 ms | T2.1: 0.538 ms | T2.2: 1.995 ms | T3: 5.135 ms
T1: 1.832 ms | T1.1: 1.032 ms | T1.2: 0.373 ms | T1.3: 0.425 ms | T2: 2.789 ms | T2.1: 0.513 ms | T2.2: 2.274 ms | T3: 4.623 ms
T1: 1.512 ms | T1.1: 0.79 ms | T1.2: 0.319 ms | T1.3: 0.4 ms | T2: 3.012 ms | T2.1: 0.604 ms | T2.2: 2.406 ms | T3: 4.525 ms
T1: 1.812 ms | T1.1: 0.891 ms | T1.2: 0.472 ms | T1.3: 0.447 ms | T2: 1.903 ms | T2.1: 0.589 ms | T2.2: 1.311 ms | T3: 3.716 ms
T1: 2.059 ms | T1.1: 1.404 ms | T1.2: 0.263 ms | T1.3: 0.39 ms | T2: 2.014 ms | T2.1: 0.548 ms | T2.2: 1.463 ms | T3: 4.074 ms
T1: 1.935 ms | T1.1: 1.165 ms | T1.2: 0.378 ms | T1.3: 0.39 ms | T2: 2.283 ms | T2.1: 0.546 ms | T2.2: 1.735 ms | T3: 4.22 ms
T1: 2.448 ms | T1.1: 0.874 ms | T1.2: 1.165 ms | T1.3: 0.406 ms | T2: 2.614 ms | T2.1: 0.544 ms | T2.2: 2.066 ms | T3: 5.062 ms
T1: 1.824 ms | T1.1: 1.067 ms | T1.2: 0.377 ms | T1.3: 0.378 ms | T2: 2.903 ms | T2.1: 0.596 ms | T2.2: 2.305 ms | T3: 4.728 ms
T1: 1.884 ms | T1.1: 1.219 ms | T1.2: 0.284 ms | T1.3: 0.379 ms | T2: 3.356 ms | T2.1: 0.565 ms | T2.2: 2.755 ms | T3: 5.241 ms

Hi,

Did you maximize the device performance first?
For benchmarking, it’s recommended to fix the clocks rate to avoid it causing any difference.

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

By default, since dynamic mode is applied, it is possible that some intensive job that cause the clocks rate rising.

Thanks.

Hi,

maximizing the device performance did not change the behavior.
There is still a noticeable performance increase after createInferBuilder has been called.

Hi AastaLLL,

it would be very awful if that would make a difference, i think.
Does tensorrt increase the clock or the same from the xavier? ’
I would like to use a lower power level as i have limited power budget.

Does this mean that I initialize a parser in every program to increase performance?

Greeting Nils