Why time divided by iterations in sampleGoogleNet.cpp of TensorRT 3.0 SDK.

We just want to profile another tensorrt instance, and clone the Profiler C++ class in sampleGoogleNet.cpp file. It’s report time is strange, so I digged to the document and header file. It seems Profiler class should report time in second instead of millisecond, or just don’t divide time by TIMING_ITERATIONS(1000).

struct Profiler : public IProfiler
	typedef std::pair<std::string, float> Record;
	std::vector<Record> mProfile;

	virtual void reportLayerTime(const char* layerName, float ms)
		auto record = std::find_if(mProfile.begin(), mProfile.end(), [&](const Record& r){ return r.first == layerName; });
		if (record == mProfile.end())
			mProfile.push_back(std::make_pair(layerName, ms));
			record->second += ms;

	void printLayerTimes()
		float totalTime = 0;
		for (size_t i = 0; i < mProfile.size(); i++)
			//the mProfile[i].second(according to first) is already in millisecond, why divide it by TIMING_ITERATIONS?
			printf("%-40.40s %4.3f<b>ms</b>\n", mProfile[i].first.c_str(), mProfile[i].second / TIMING_ITERATIONS);
			totalTime += mProfile[i].second;
		printf("Time over all layers: %4.3f\n", totalTime / TIMING_ITERATIONS);

} gProfiler;

This is the reportLayerTime function declaration in NvInfer.h

virtual void nvinfer1::IProfiler::reportLayerTime	(	const char * 	layerName,
float 	ms 
pure virtual
layer time reporting callback

layerName	the name of the layer, set when constructing the network definition
ms	the time in milliseconds to execute the layer

What’s wrong?


As you said, in

virtual void nvinfer1::IProfiler::reportLayerTime ( const char * layerName,
                                                    float ms

ms is the time in milliseconds.

In timeInference() function, we launch tensorRT TIMING_ITERATIONS=1000 times and accumulate the execution time to totalTime.
When printing, we need to divide the totalTime with TIMING_ITERATIONS=1000 to get the average execution time.


Hmm, so tensorrt is really fast. I just notice my title is wrong, 1000 times just make me misunderstand. Thanks for you reply.

the sampleGoogleNet.cpp use batch of 4, but the Profiler haven’t divide time by batch number, it should faster actually for one input?


Profiler measure the execution time of one context->execute() call.
For more significant batch size, the execution time goes longer.
Speed from slow to fast is: Batch=1 x 4 > Batch=4 > Batch=1

By default, we use batch size=4.
You can modify it to batch size=1 or divider the execution time by 4 to get the profiling data you preferred.

Usually, launch engine with batch=4 is faster than launch engine with batch=1 four times.
Depends on if you can collect four inputs and inference them at the same time. (Ex. four stream input)


How good news, we can profile our program more precise now, the “strange” word goes now, thanks.