There are many topic noticed the cpu usage problem by anything synchronize function(cudaDevice/Stream/EventSynchronize). We have created a profile program from TensorRT GoogleNet sample, just replace googlenet with SSD and plugins, and we noticed both GPU and CPU core run with 99% usage. The CPU usage only drop to 5% when I change output tensor from [“mbox_loc”, “mbox_conf_flatten”] to [“conv9_2_mbox_conf”]. There is none data transfore between the host and device with execute epoch.
void timeInference(ICudaEngine* engine, int batchSize)
// zero the input buffer
CHECK(cudaMemset(buffers[inputIndex], 0, inputSize));
for (int i = 0; i < TIMING_ITERATIONS;i++)
// release the context and buffers
The output tensor size is [“mbox_loc” = 3,434,496, “mbox_conf_flatten” = 3,434,496] [“conv9_2_mbox_conf” = 9,216] bytes with batch of 8. We also test with context->queue() function and the result is same.
Is there anyway to make OS schedule out TensorRT host thread to BLOCK state and other thread can use this CPU core? It is not reasonable CPU core usage high when thread mostly is in BLOCK state, we profile the code with NVVP, and it indicate host thread is doing cudaStreamSynchronize which display with brown color bar in timeline view.
Our plateform is
Tesla P4 + Driver 390.30 + CUDA 9.0 + CUDNN 7.1.1 + TensorRT 3.0.4 + Ubuntu 16.04.3 TLS
Please update the sftechSSDProf.cpp:253 calibrator.create("/home/hello…") with proper int8_pruning0313.txt file name if any one recieve the profile code from internally.