CPU seems to be slowed down by large TensorRT Engines - Cache problem?

Description

Hi there!

Recently I have been working on deploying YOLOv5 in TensorRT in C++. I am using a Jetson AGX Xavier for deployment.

My code is generally working great, however I noticed some very interesting behaviour: When I load a larger (and slower) FP32 engine compared to a FP16 version of the same model, the pre- and postprocessing steps performed on the CPU are also slower. My assumption is that it may be related to the CPUs cache sizes, but I may very well be wrong.

The buffers are float arrays, and allocated using cudaMallocHost. The buffer sizes are identical accross the different engines.

Very roughly, my code works like this for every loop:

Preprocessing: Copy every element of a cv::Mat into the float input Buffer
Inference: Call executionContext->executeV2(buffers)
Postprocessing: Read and evaluate the output buffer on the (CPU)

Naturally, executeV2 is much slower when working with FP32 engines compared to FP16. However, I did not expect the Pre- and Postprocessing steps, which are located entirely on the CPU, to be slower too. Could it have to do with the Engine somehow landing in the CPU cache or is there something else I’m missing?

Thank you in advance!

Environment

TensorRT Version: 8.0.0.1
GPU Type: AGX Xavier Volta GPU
GPU Type: JetPack 4.6

Hi,

We are moving this issue to Jetson AGX forum to get better help. Meanwhile, we recommend you to please share issue repro for better debugging.

Thank you.

Hi,

Since the processing is identical, it should not have too much performance difference.
Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

More, could you share the CPU utilization for both FP32 and FP16 inference with us?

$ sudo tegrastats

Thanks.

Hi,

thanks for the reply. I was using nvpmodel 0 already, but not jetson_clocks. Using the latter actually improved performance, but did not change the weird pre- and postprocessing times.

The implementations are 100% identical and there is no difference in the code no matter which engine I load.

These are the CPU stats and timings (averaged over 1000 measurements) I got:

FP16:

AVG Latency: 33.89ms [PRE: 1.02 CUDA: 18.17 PST: 2.85]

RAM 2692/31929MB (lfb 6800x4MB) SWAP 0/15964MB (cached 0MB) CPU [7%@2265,4%@2265,7%@2265,7%@2265,6%@2265,2%@2265,34%@2265,3%@2265] EMC_FREQ 39%@2133 GR3D_FREQ 82%@1377 APE 150 MTS fg 0% bg 4% AO@35.5C GPU@40.5C Tdiode@38.25C PMIC@50C AUX@35.5C CPU@37C thermal@37.6C Tboard@35C GPU 16107/16348 CPU 1688/1766 SOC 4143/4146 CV 0/0 VDDRQ 1994/2016 SYS5V 3920/3919
RAM 2692/31929MB (lfb 6800x4MB) SWAP 0/15964MB (cached 0MB) CPU [5%@2265,4%@2265,5%@2265,5%@2265,6%@2265,6%@2265,33%@2265,10%@2265] EMC_FREQ 39%@2133 GR3D_FREQ 73%@1377 APE 150 MTS fg 0% bg 4% AO@35.5C GPU@41C Tdiode@38.5C PMIC@50C AUX@35.5C CPU@37C thermal@37.45C Tboard@35C GPU 16261/16346 CPU 1688/1764 SOC 4143/4146 CV 0/0 VDDRQ 1994/2016 SYS5V 3920/3919
RAM 2692/31929MB (lfb 6800x4MB) SWAP 0/15964MB (cached 0MB) CPU [6%@2265,7%@2265,3%@2265,2%@2265,4%@2265,3%@2265,33%@2265,5%@2265] EMC_FREQ 39%@2133 GR3D_FREQ 74%@1377 APE 150 MTS fg 0% bg 4% AO@35.5C GPU@41C Tdiode@38.75C PMIC@50C AUX@35.5C CPU@37C thermal@37.6C Tboard@35C GPU 16261/16344 CPU 1841/1766 SOC 4143/4146 CV 0/0 VDDRQ 1994/2015 SYS5V 3920/3919

Engine Bindings:

[1 3 640 640] ("images") Datatype: 0
[1 3 80 80 85] ("528") Datatype: 1
[1 3 40 40 85] ("596") Datatype: 1
[1 3 20 20 85] ("664") Datatype: 1
[1 25200 85] ("output") Datatype: 0

FP32:
AVG Latency: 59.73ms [PRE: 1.85 CUDA: 45.14 PST: 0.68]

RAM 2761/31929MB (lfb 6802x4MB) SWAP 0/15964MB (cached 0MB) CPU [6%@2265,6%@2265,9%@2265,8%@2265,8%@2265,6%@2265,8%@2265,6%@2265] EMC_FREQ 44%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 0% bg 3% AO@37C GPU@43.5C Tdiode@40.25C PMIC@50C AUX@37C CPU@38C thermal@39.25C Tboard@36C GPU 20658/20480 CPU 1224/1261 SOC 4745/4614 CV 0/0 VDDRQ 2600/2600 SYS5V 4160/4130
RAM 2762/31929MB (lfb 6802x4MB) SWAP 0/15964MB (cached 0MB) CPU [6%@2265,6%@2265,5%@2265,6%@2265,4%@2265,6%@2265,10%@2265,9%@2265] EMC_FREQ 44%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 0% bg 3% AO@37C GPU@43.5C Tdiode@40.25C PMIC@50C AUX@37C CPU@38C thermal@39.4C Tboard@36C GPU 20505/20480 CPU 1378/1263 SOC 4592/4613 CV 0/0 VDDRQ 2601/2600 SYS5V 4120/4130
RAM 2761/31929MB (lfb 6802x4MB) SWAP 0/15964MB (cached 0MB) CPU [5%@2265,5%@2265,4%@2265,4%@2265,5%@2265,9%@2265,12%@2265,4%@2265] EMC_FREQ 44%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 0% bg 2% AO@37C GPU@43.5C Tdiode@40.25C PMIC@50C AUX@37C CPU@38.5C thermal@39.4C Tboard@36C GPU 20505/20481 CPU 1225/1262 SOC 4745/4616 CV 0/0 VDDRQ 2601/2600 SYS5V 4120/4130

Engine Bindings:

[1 3 640 640] ("images") Datatype: 0
[1 3 80 80 85] ("528") Datatype: 0
[1 3 40 40 85] ("594") Datatype: 0
[1 3 20 20 85] ("660") Datatype: 0
[1 25200 85] ("output") Datatype: 0

I find it very odd that for preprocessing, FP16 is faster, while for postprocessing, FP32 is faster.

The engine bindings are identical except for the datatype of the central bindings (which I do not access).

I cannot share the full code sadly, but this is the preprocessing that I do (img being an OpenCV::Mat):

float *pFloat = static_cast<float *>(buffers[inputIndex]);
    // forEach is significantly faster than all other methods to traverse over the cv::Mat
    img.forEach<cv::Vec3b>([&](cv::Vec3b &p, const int *position) -> void {
        // p[0-2] contains bgr data, position[0-1] the row-column location
        // Incoming data is BGR, so convert to RGB in the process
        int index = model_height * position[0] + position[1];
        pFloat[index] = p[2] / input_float_divisor;
        pFloat[model_size + index] = p[1] / input_float_divisor;
        pFloat[2 * model_size + index] = p[0] / input_float_divisor;
    });

Engine execution:

// Invoke asynchronous inference
    context->enqueueV2(buffers, 0, nullptr);

    float *model_output = static_cast<float *>(buffers[outputIndex]);

    // wait for inference to finish
    cudaStreamSynchronize(0);

Postprocessing:

unsigned long dimensions =
        5 + num_classes; // 0,1,2,3 ->box,4->confidence,5-85 -> coco classes confidence
    const unsigned long confidenceIndex = 4;
    const unsigned long labelStartIndex = 5;

    int highest_conf_index = 0;
    int highest_conf_label = 0;
    float highest_conf = 0.4f; // confidence threshold is 40%
    for (int index = 0; index < output_size; index += dimensions) {
        float confidence = model_output[index + confidenceIndex];

        // for multiple classes, combine the confidence with class confidences
        // for single class models, this step can be skipped
        if (num_classes > 1) {
            if (confidence <= highest_conf) {
                continue;
            }
            for (unsigned long j = labelStartIndex; j < dimensions; ++j) {
                float combined_conf = model_output[index + j] * confidence;
                if (combined_conf > highest_conf) {
                    highest_conf = combined_conf;
                    highest_conf_index = index;
                    highest_conf_label = j - labelStartIndex;
                }
            }
        } else {
            if (confidence > highest_conf) {
                highest_conf = confidence;
                highest_conf_index = index;
                highest_conf_label = 1;
            }
        }
    }

    // Evaluate results

    if (highest_conf > 0.4f) {
        if (TRACE_LOGGING)
            ROS_INFO("Detected class %d with confidence %lf", highest_conf_label, highest_conf);
        detection->propability = highest_conf;
        detection->classID = highest_conf_label;
        detection->centerX = model_output[highest_conf_index];
        detection->centerY = model_output[highest_conf_index + 1];
        detection->width = model_output[highest_conf_index + 2];
        detection->height = model_output[highest_conf_index + 3];
    } else {
        detection->propability = 0.0;
    }

Given that input- and output sizes are identical accross both engines, I reall do not see where the difference is coming from…

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.