TensorRT consumed 100% CPU core with 6MiB output data.

Hi

We still optimizing SSD network, and it is very fast now, but we encounter CPU core bottleneck problem, one core is running under 100% usage, we don’t know why it happen as there is no data copy to host memory, only marked mbox_conf and mbox_conf_flat as the output in TensorRT.

We profile manually with only output conv9, and the CPU core usage is low(7%).

Is there any profile tool except nvprof?

Thanks.

Hi,

You can open a nvprof file with NVVP to get more information.

CPU Details view of NVVP shows the amount of time your application spends executing functions on the CPU.
[url]Profiler :: CUDA Toolkit Documentation

Thanks.

Hi,

There are many topic noticed the cpu usage problem by anything synchronize function(cudaDevice/Stream/EventSynchronize). We have created a profile program from TensorRT GoogleNet sample, just replace googlenet with SSD and plugins, and we noticed both GPU and CPU core run with 99% usage. The CPU usage only drop to 5% when I change output tensor from [“mbox_loc”, “mbox_conf_flatten”] to [“conv9_2_mbox_conf”]. There is none data transfore between the host and device with execute epoch.

void timeInference(ICudaEngine* engine, int batchSize)
{
    ...
	// zero the input buffer
	CHECK(cudaMemset(buffers[inputIndex], 0, inputSize));

	for (int i = 0; i < TIMING_ITERATIONS;i++)
		context->execute(batchSize, buffers);

	// release the context and buffers
	context->destroy();
	CHECK(cudaFree(buffers[inputIndex]));
	CHECK(cudaFree(buffers[outputIndex]));
}

The output tensor size is [“mbox_loc” = 3,434,496, “mbox_conf_flatten” = 3,434,496] [“conv9_2_mbox_conf” = 9,216] bytes with batch of 8. We also test with context->queue() function and the result is same.

Is there anyway to make OS schedule out TensorRT host thread to BLOCK state and other thread can use this CPU core? It is not reasonable CPU core usage high when thread mostly is in BLOCK state, we profile the code with NVVP, and it indicate host thread is doing cudaStreamSynchronize which display with brown color bar in timeline view.
Our plateform is
Tesla P4 + Driver 390.30 + CUDA 9.0 + CUDNN 7.1.1 + TensorRT 3.0.4 + Ubuntu 16.04.3 TLS

Please update the sftechSSDProf.cpp:253 calibrator.create(“/home/hello…”) with proper int8_pruning0313.txt file name if any one recieve the profile code from internally.

Thanks.

Hi,

We reviewed our code, and removed some cudaStreamSynchronize call in plugin’s code, then we got a little low one CPU core usage(85%).

Thanks.

Hi,

Could you share more information about the implementation of the custom function.
Is it a CUDA implementation or a CPU-based code?

More, could you help to check if this issue also occurs in the case without plugin function?

Thanks.

Hi,

NeoSong already send you the SSD low-rank implement plugin functions, I just observe CPU usage issue with same scenario with little different in network but has the same plugin functions.

The plugin is in CUDA implemented, and we test with a tiny program. Maybe we can sent you another CPU usage issue code base.

I will check the effect without one by one plugin. We already known, If the output is [“mbox_loc”] only, CPU usage goes down to 75%. I think TensorRT will disable some layer calculation if it is not related to output. Is that right?

Thanks.

Hi,

Yes, TensorRT do apply sorting to find the essential path when optimizing.

For a detection use case, could you share some information about the implementation of handling bounding box merging/grouping?
Maybe this operation is the source of high CPU utilization.

Thanks.

Hi,

I have updated the tiny program and do a little more test, the CPU usage with output settings:

[mbox_loc] very low
[mbox_loc, mbox_conf_flatten] 80%
[mbox_conf_floatten] 20%

It’s very interesting and hard to find which group and why consume CPU core. The tiny program never use TensorRT output data. I think it worth to using NVVP again.

I will share more information if we need more help from you.

Thanks.