Check if first enqueue() execution is completed

redoutracer · May 5, 2020, 12:58pm

I’m working on a program that runs 3 neural networks, each one after previous is completed. Like this:

Run NN1
If NN1 found objects on picture run NN2
If NN2 found objects on picture run NN3

So I need a way to check if separate calls of enqueue() are completed to decide whether will be next NN launched or will it wait for next output from previous NN.
And enqueue() takes only one cudaEvent_t argument — only to check if input is ready to recieve new data.
But I can’t find a way to check if enqueue() is completed.
Only cudaStreamSynchronize(), but it will wait untill enqueue() is completed which is unacceptable. And I’m not even sure will it wait for the first enqueue() call to be completed or for all of them.
So is there any way to check this?

SunilJB · May 6, 2020, 10:02am

Hi,

Can you share the pseudo code/sample code so that we can understand your requirement?

Thanks

redoutracer · May 6, 2020, 10:51am

Sample code — unfortunately no, I’m not allowed to share it (NDA).
So here’s pseudo code.
Note that in the final variant of code there could be next queue() calls before previous are completed (in case NN2 or NN3 are slowed down on frame; inference time of same engine can vary up to 10 ms).

class NeuralNetwork { ... };

void processNewFrame(cv::Mat &frame)
{
	if (NN3.executionCompleted())
	{
		auto result = NN3.getResult();
		// write result
	}
	
	if (NN2.executionCompleted())
	{
		// crop stored biggest detected object from frame:
		cv::Mat new_frame= new Mat(NN2.used_frames().pop(), NN2.biggest_object_s_rectangle());
		NN3.enqueueNewFrame(new_frame);
	}
	
	if (NN1.executionCompleted())
	{
		cv::Mat new_frame= new Mat(NN1.used_frame().pop(), NN1.biggest_object_s_rectangle());
		NN2.enqueueNewFrame(new_frame);
	}
	
	NN1.enqueueNewFrame(frame);
}

void main()
{
	// ...
	NeuralNetwork NN1, NN2, NN3;
	// open video stream
	// main loop:
	while (stream.isOpened())
	{
		// get next frame:
		stream >> frame;
		// run networks:
		processNewFrame(frame);
	}
}

Method executionCompleted() here should return true if next (single) frame (not all frames in queue) is processed and result is ready to use.
That’s what I’m looking for.
I found cudaStreamQuery() but it returns cudaError::cudaSuccess only when execution of all enqueue() calls of this CUDA stream is completed.

SunilJB · May 8, 2020, 8:14am

Few things you can try:

Use the same stream for three networks. Take an one buffer example,

cudaMemcpyAsync from host to device (buffer0, stream0);
for (int batch = 0; batch < max; ++batch)
{
NN1->enqueue(buffer0, stream0);
cudaMemcpyAsync from device to host to get object detection prob(stream0);
cudaStreamSynchronize(stream0);
if (object detection prob larger than threshold)
{
NN2->enqueue(buffer0, stream0);
cudaMemcpyAsync from device to host to get object detection prob (stream0);
cudaStreamSynchronize(stream0);
if (object detection prob larger than threshold)
{
NN3->enqueue(buffer0, stream0);
cudaMemcpyAsync from device to host to get object detection prob (stream0);
cudaStreamSynchronize(stream0);
Save results (CPU code);
}
}
}
You can also check CUDA events
NVIDIA Deep Learning TensorRT Documentation
Merge all 3 networks as single network and generate the single TRT engine

Thanks

redoutracer · May 8, 2020, 9:25am

Use the same stream for three networks. Take an one buffer example,

Won’t work.
First of all they take different input:

NN1 takes whole frame and looks for vehicles
NN2 takes cropped biggest (and probably closest) vehicle detected by NN1 and looks for license plate
NN3 takes cropped license plate and looks for symbols

Second, what you’re suggesting is to wait sequentually for all 3 NNs. This takes a damn lot of time (around 150 ms on Jetson Nano) and you don’t need enqueue() for that.
What you wrote is just execute() calls, because in fact enqueue() + cudaStreamSynchronize() does exactly this: waits untill inference launched with enqueue() is completed:

Using the cudaStreamSynchronize function after calling launchInference ensures GPU computations complete before the results are accessed

It will take (3 x single_inference_time):

time →

[NN1.enqueue][sync][NN2.enqueue][sync][NN3.enqueue][sync]

And what I’m trying to do should take about ~1.1-1.2x time of a single run, not 3x:

time →

[NN3.enqueue() execution]
__[NN2.enqueue() execution]
____[NN1.enqueue() execution]

Like this ↑.

2, You can also check CUDA events

The only built-in event is good only to jam in data as fast as you can and then wait until all enqueue() calls are completed:

TensorRT also includes an optional CUDA event in the method IExecutionContext::enqueue that will be signaled once the input buffers are free to be reused.

But it is completely useless if you need to find when inference of separate enqueue() calls is finished.
And as I wrote earlier cudaStreamQuery() and cudaStreamSynchronize() wait untill all calls of enqueue() are completed, not separate ones.

Merge all 3 networks as single network and generate the single TRT engine

Not possible, it takes too much time (look #1).

redoutracer · May 14, 2020, 2:14pm

So I’ve decided to keep only 1 inference in cudaStream queue. It still should be faster than 3 execute() in a row. Now I’m using this code (note that idea remains same: prepare and launch NN3’s enqueue(), while it works launch NN2 and so on):

void ProcessCurrentFrame()
{
	double time_span;
	auto start = std::chrono::high_resolution_clock::now();

	if (NN3->waitForResult()) // returns true when done
	{
		if (NN3->skipNext) // no license plate picture from NN2 so no frame to look for symbols
			NN3->skipNext = false; // turn off for next run
		else
			NN3->copyOutputBuffers(); // only copying, processing in the end to not slow down other networks
	}
	
	if (NN2->waitForResult())
	{
		if (NN2->skipNext) // no objects from NN1
			NN2->skipNext = false;
		else
		{
			NN2->copyOutputBuffers();
			// process output immidiately to use result in NN3
			auto result = NN2->getResult();
			// if at least one object detected
			if (result.size() > 0)
			{
				// select biggest (= closest) one
				cv::Rect2f biggestObject = getBiggestObject(result);
				// offset from cropped object (result of NN1) coordinates to original frame coordinates
				biggestObject += cv::Point2f(
					NN2->used_rects.front().x,
					NN2->used_rects.front().y
				);
				// void sendFrameToNN(cv::Mat frame, cv::Rect2f roi)
				NN3->sendFrameToNN( // look for symbols
					NN2->used_frames.front(), // stored original frame
					biggestObject // ROI of the biggest detected object
				);
			}
			else // no objects detected
				NN3->skipNext = true;
		}
		NN2->used_frames.pop();
		NN2->used_rects.pop();
	}
	
	// basically same as for NN2:
	if (NN1->waitForResult())
	{
		NN1->copyOutputBuffers();
		auto result = NN1->getResult();
		if (result.size() > 0)
		{
			cv::Rect2f biggestObject = getBiggestObject(result);
			NN2->sendFrameToNN( // look for license plate
				NN1->used_frames.front(),
				biggestObject
			);
		}
		else
			NN2->skipNext = true;
	}
	
	// look for vehicles
	NN1->sendFrameToNN(currentFrame);
	
	std::vector<DetectedObject> finalObjects = NN3->getResult();
	
	// do things with result
	// ...
	
	time_span = std::chrono::duration_cast<std::chrono::duration<double>>(std::chrono::high_resolution_clock::now() - start).count();
	avg = (avg * internal_counter + time_span) / (internal_counter + 1);
	++internal_counter;
	std::cout << time_span << " secs, " << avg << " secs avg\n";
}

And I’m getting same time as when I’m using execute().
I’m creating CUDA stream with flags, do copying data and call inference — all using async methods:

class NN
{
	// ...
	cudaStream_t fStream;
	nvinfer1::IExecutionContext* fContext;
	float** fOutBuffers; // store output of network;
	void* bindings[3]; // pointers on GPU memory
	nvinfer1::Dims fInputDims; // network input dimensions
	nvinfer1::Dims fOutputDims[2]; // network output dimensions, 2 layers
	
	NN()
	{
		// ...
		CHECK(cudaMalloc(&bindings[0], fInputDims.d[0] * fInputDims.d[1] * fInputDims.d[2] * sizeof(float)));
		CHECK(cudaMalloc(&bindings[1], fOutputDims[0].d[0] * fOutputDims[0].d[1] * fOutputDims[0].d[2] * sizeof(float)));
		CHECK(cudaMalloc(&bindings[2], fOutputDims[1].d[0] * fOutputDims[1].d[1] * fOutputDims[1].d[2] * sizeof(float)));
		cudaStreamCreateWithFlags(&fStream, cudaStreamNonBlocking);
	}

	void sendFrameToNN(cv::Mat &frame, cv::Rect2f &rect)
	{
		auto blob = convertImageToBlob(
			cv::Mat(frame.clone(), rect),
			fInputDims.d[1],
			fInputDims.d[2],
			false
		);
		CHECK(cudaMemcpyAsync(bindings[0], (void*)(blob.data()), fInputDims.d[0] * fInputDims.d[1] * fInputDims.d[2] * sizeof(float), cudaMemcpyHostToDevice, fStream));
		fContext->enqueue(1, bindings, fStream, &inputIsFree);
		// ...
	}

	bool waitForResult()
	{
		cudaStreamSynchronize(fStream);
		return true;
	}

	void copyOutputBuffers()
	{
		CHECK(cudaMemcpyAsync(fOutBuffers[0], bindings[1], fOutputDims[0].d[0] * fOutputDims[0].d[1] * fOutputDims[0].d[2] * sizeof(float), cudaMemcpyDeviceToHost, fStream));
		CHECK(cudaMemcpyAsync(fOutBuffers[1], bindings[2], fOutputDims[1].d[0] * fOutputDims[1].d[1] * fOutputDims[1].d[2] * sizeof(float), cudaMemcpyDeviceToHost, fStream));
	}
	// ...
}

And in NVVP it looks like all 3 NN use same stream (№14 to be exact).
Changing all this code to sync variants only leads to using Default CUDA stream instead of 14th.

SunilJB · May 18, 2020, 4:40am

If input of NN2 does not depend on NN1 output, then they can just enqueue NN2 before NN1 finished using different cuda stream.

And just abandon the result of NN2 if NN1 didn’t find objects. It’s like manual branch prediction in modern CPU.

Thanks

redoutracer · May 18, 2020, 7:59am

NN2 does depend on NN1 result, but yeah, it really is similar to branch prediction.

Now I found one really weird thing: enqueue() + cudaStreamSynchronize(fStream) take a lot of time to work, roughly 72 ms in total.

While execute() takes only 55 ms.

And cudaStreamSynchronize() takes ~35 ms even after execute() (i.e. no job in stream, so it shouldn’t take that much time).

Can you explain me why this is happening?

SunilJB · May 19, 2020, 5:00am

Hi,

Can you share sample code to reproduce the issue so we can better help?
Also if possible, please share the profiler output as well.

Thanks