The TensorRT inference API consumes more CPU resources(Jetson Xavier NX)

The TensorRT inference API consumes 15-20% of CPU resources on a single core.

Only the TensorRT AI inference interface was executed in the test, which did not include CPU/GPU data transfer.

In theory it shouldn’t take up CPU resources, it shouldn’t take up so much. Is this normal? If not, how should I fix or optimize it?

environment info:
TensorRT 8.5.2.2
cudnn 8.6.0.166
CUDA 11.4
ubuntu 20.04
jetson Xavier NX
ai mode: yolov8n-pose.engine

Hi,

Could you share the complete source that was used for testing CPU utilization?

Although enqueue is an asynchronous call, the following synchronization will force the CPU to wait for the GPU task to finish and require CPU resources.
To share further suggestions/comments, we will need more details about your source.
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-853/api/c_api/classnvinfer1_1_1_i_execution_context.html#a04689994873d4f788d35f6ec6ab247bf

Thanks.

Thank you. Here is the detailed code. Please have a look

std::vector<DetectResult> YoloDetecter::inference(cv::Mat& img)
{
    auto t1 = std::chrono::system_clock::now();
    preprocess2gpu(img,vBufferD[0],  kInputH, kInputW, stream); //infer preprocessing
    auto t2 = std::chrono::system_clock::now();
    inference(); // infer
    auto t3 = std::chrono::system_clock::now();

    std::vector<Detection> res;
    nms(res, outputData, kConfThresh, kNmsThresh); //infer postprocessing
    auto t4 = std::chrono::system_clock::now();

    std::vector<DetectResult> final_res;
    for (size_t j = 0; j < res.size(); j++)
    {
        cv::Rect r = get_rect_adapt_landmark(img, res[j].bbox, res[j].keypoints);
        DetectResult single_res;
        single_res.tlwh=r;
        memcpy(single_res.keypoints, res[j].keypoints, sizeof(float) * kNumberOfPoints * 3);
        single_res.conf=res[j].conf;
        single_res.class_id=(int)res[j].class_id;
        final_res.push_back(single_res);
    }
    auto t5 = std::chrono::system_clock::now();

    std::cout << "#TRT2: preprocess time: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << "ms" << std::endl;
    std::cout << "#TRT2: inference time: " << std::chrono::duration_cast<std::chrono::milliseconds>(t3 - t2).count() << "ms" << std::endl;
    std::cout << "#TRT2: nms time: " << std::chrono::duration_cast<std::chrono::milliseconds>(t4 - t3).count() << "ms" << std::endl;
    std::cout << "#TRT2: final_res time: " << std::chrono::duration_cast<std::chrono::milliseconds>(t5 - t4).count() << "ms" << std::endl;
    std::cout << "#TRT2: all time: " << std::chrono::duration_cast<std::chrono::milliseconds>(t5 - t1).count() << "ms" << std::endl;
    std::cout << std::endl;

    return final_res;
}

void preprocess2gpu(const cv::Mat& srcImg, float* dstData, const int dstHeight, const int dstWidth, const cudaStream_t& preprocess_s)
{
    int srcHeight = srcImg.rows;
    int srcWidth = srcImg.cols;
    int srcElements = srcHeight * srcWidth * 3;
    int dstElements = dstHeight * dstWidth * 3;

    cudaMemcpy(srcDevData, srcImg.data, sizeof(uchar) * srcElements, cudaMemcpyHostToDevice);

    // calculate width and height after resize
    int w, h, x, y;
    float r_w = dstWidth / (srcWidth * 1.0);
    float r_h = dstHeight / (srcHeight * 1.0);
    if (r_h > r_w) {
        w = dstWidth;
        h = r_w * srcHeight;
        x = 0;
        y = (dstHeight - h) / 2;
    }
    else {
        w = r_h * srcWidth;
        h = dstHeight;
        x = (dstWidth - w) / 2;
        y = 0;
    }

    dim3 blockSize(32, 32);
    dim3 gridSize((dstWidth + blockSize.x - 1) / blockSize.x, (dstHeight + blockSize.y - 1) / blockSize.y);

    // letterbox and resize
    letterbox<<<gridSize, blockSize, 0, preprocess_s>>>(srcDevData, srcHeight, srcWidth, midDevData, dstHeight, dstWidth, h, w, y, x);
    process<<<gridSize, blockSize>>>(midDevData, dstData, dstHeight, dstWidth);
}

void YoloDetecter::inference()
{
    context->enqueue(1, (void**)vBufferD.data(), stream, nullptr);
    CUDA_CHECK(cudaMemcpyAsync((void *)outputData, vBufferD[1], vTensorSize[1], cudaMemcpyDeviceToHost, stream));
    CUDA_CHECK(cudaStreamSynchronize(stream));
}

Hi,

It looks like there are two inference functions in your source.
Which one do you use for benchmarking?

Is it possible the CPU resources are occupied by the nms(...) in the top inference function?

Thanks.

Inclusion relation,use ‘std::vector YoloDetecter::inference(cv::Mat& img)’ function.

nms(…) doesn’t take much cpu.

Hi,

Is your model can be run with trtexec binary?
If yes, could you check if inferring via trtexec also takes CPU resources?

Thanks.

Is this still an issue to support? Any result can be shared?

It also takes 25% of the cpu.

There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

Could you share the ONNX file that can be deployed with trtexec?
We want to reproduce this locally to check the usage further.

Thanks.