C++ - Stuck with YoloV4, ONNX and TensorRT

I’m doing some detection using YoloV4/C++/OpenCV and it’s running pretty good. However, to improve time consumption I’m trying to move everything to NVIDIA TensorRT and I’m feeling lost there.

I converted the .weights file to ONNX using the TensorRT tools, then converted the ONNX model to TensorRT engine like this :

void ONNXConvert()
{
    MyLogger logger;
    nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger);
    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1U << static_cast<int>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));

    // Load ONNX model
    const auto parser = nvonnxparser::createParser(*network, logger);

    // Parse the ONNX model
    // Some code here...

    std::ifstream onnxFile(onnxModelFile, std::ios::binary);
    if (!onnxFile)
    {
        std::cerr << "Error opening ONNX model file. " << onnxModelFile << std::endl;
        return;
    }
    onnxFile.seekg(0, onnxFile.end);
    const size_t modelSize = onnxFile.tellg();
    onnxFile.seekg(0, onnxFile.beg);

    // Allocate buffer to hold the ONNX model
    std::vector<char> onnxModelBuffer(modelSize);
    onnxFile.read(onnxModelBuffer.data(), modelSize);

    if (!parser->parse(onnxModelBuffer.data(), modelSize))
    {
        std::cerr << "Error parsing ONNX model." << std::endl;
        return;
    }

    // Create a builder configuration
    nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();

    // Set configuration options as needed
    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 1 << 30);

    nvinfer1::IHostMemory* serializedEngine = builder->buildSerializedNetwork(*network, *config);
    std::cout << "Number of layers in the network: " << network->getNbLayers() << std::endl;
    std::ofstream outFile("yolov4.engine", std::ios::binary);
    outFile.write(reinterpret_cast<const char*>(serializedEngine->data()), serializedEngine->size());
    outFile.close();

    builder->destroy();
    network->destroy();
    serializedEngine->destroy();
}

This done, I can load the generated engine and perform the inference, everything seems to go well until I try to parse the detection results.

I want to know the classes probabilities and the bounding boxes coordinates, but everything I have is inconsistent values.

From my YoloV4 config, I know I have :

20 classes
Input width = 608
Input height = 608
Channels = 3
9 anchors with dimensions { 12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401 }

After the inference, I have 2 output buffers :

a 1x22743x1x4 where I guess I will find the bounding boxes coordinates
a 1x22743x20 where I guess I will find the classes probabilities

And this where I’m getting lost. How must I parse the detections to correctly compute the coordinates and classes probabilities ?

As I have 22743 detections, I guess it comes from the CSPDarknet53 backbone (3 grids 19x19, 38x38 and 76x76, 3 anchors each).

I innocently tried to directly parse the outputs like that :

for (int d = 0; d < 22743; d++)
{
    float maxProb = -1000.0f;
    int classId = -1;
    for (int c = 0; c < 20; c++)
    {
        if (classes[d * 20 + c] > maxProb)
        {
            maxProb = classes[d * 20 + c];
            classId = c;
        }
    }

    if (maxProb > CONFIDENCE_THRESHOLD)
    {
        float boxX = boxes[d * 4];
        float boxY = boxes[d * 4 + 1];
        float boxW = boxes[d * 4 + 2];
        float boxH = boxes[d * 4 + 3];
    }
}

But everything I got is tiny probabilities (like < 1E-05), and tiny and sometimes negatives boxes coordinates.

Could someone give me hand about that ? Every help will really be appreciated.

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:

Thanks!

Hi, and thank you for your answer.

As I simply started with the base yolov4 model, and simply converted it to onnx, I’ll need a little time to understand clearly what I have to do next and how to do it, especially how to use trtexec.

I don’t know if it’s what asked here, but here is my initial Yolo V4 model :

And here’s the onnx model after conversion :

Finally, the trtexec result :

&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # /home/stephane/Downloads/TensorRT/build/out/trtexec --onnx=model.onnx --exportTimes=trace.json
[11/30/2023-16:13:27] [I] === Model Options ===
[11/30/2023-16:13:27] [I] Format: ONNX
[11/30/2023-16:13:27] [I] Model: model.onnx
[11/30/2023-16:13:27] [I] Output:
[11/30/2023-16:13:27] [I] === Build Options ===
[11/30/2023-16:13:27] [I] Max batch: explicit batch
[11/30/2023-16:13:27] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[11/30/2023-16:13:27] [I] minTiming: 1
[11/30/2023-16:13:27] [I] avgTiming: 8
[11/30/2023-16:13:27] [I] Precision: FP32
[11/30/2023-16:13:27] [I] LayerPrecisions: 
[11/30/2023-16:13:27] [I] Layer Device Types: 
[11/30/2023-16:13:27] [I] Calibration: 
[11/30/2023-16:13:27] [I] Refit: Disabled
[11/30/2023-16:13:27] [I] Version Compatible: Disabled
[11/30/2023-16:13:27] [I] ONNX Native InstanceNorm: Disabled
[11/30/2023-16:13:27] [I] TensorRT runtime: full
[11/30/2023-16:13:27] [I] Lean DLL Path: 
[11/30/2023-16:13:27] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[11/30/2023-16:13:27] [I] Exclude Lean Runtime: Disabled
[11/30/2023-16:13:27] [I] Sparsity: Disabled
[11/30/2023-16:13:27] [I] Safe mode: Disabled
[11/30/2023-16:13:27] [I] Build DLA standalone loadable: Disabled
[11/30/2023-16:13:27] [I] Allow GPU fallback for DLA: Disabled
[11/30/2023-16:13:27] [I] DirectIO mode: Disabled
[11/30/2023-16:13:27] [I] Restricted mode: Disabled
[11/30/2023-16:13:27] [I] Skip inference: Disabled
[11/30/2023-16:13:27] [I] Save engine: 
[11/30/2023-16:13:27] [I] Load engine: 
[11/30/2023-16:13:27] [I] Profiling verbosity: 0
[11/30/2023-16:13:27] [I] Tactic sources: Using default tactic sources
[11/30/2023-16:13:27] [I] timingCacheMode: local
[11/30/2023-16:13:27] [I] timingCacheFile: 
[11/30/2023-16:13:27] [I] Heuristic: Disabled
[11/30/2023-16:13:27] [I] Preview Features: Use default preview flags.
[11/30/2023-16:13:27] [I] MaxAuxStreams: -1
[11/30/2023-16:13:27] [I] BuilderOptimizationLevel: -1
[11/30/2023-16:13:27] [I] Input(s)s format: fp32:CHW
[11/30/2023-16:13:27] [I] Output(s)s format: fp32:CHW
[11/30/2023-16:13:27] [I] Input build shapes: model
[11/30/2023-16:13:27] [I] Input calibration shapes: model
[11/30/2023-16:13:27] [I] === System Options ===
[11/30/2023-16:13:27] [I] Device: 0
[11/30/2023-16:13:27] [I] DLACore: 
[11/30/2023-16:13:27] [I] Plugins:
[11/30/2023-16:13:27] [I] setPluginsToSerialize:
[11/30/2023-16:13:27] [I] dynamicPlugins:
[11/30/2023-16:13:27] [I] ignoreParsedPluginLibs: 0
[11/30/2023-16:13:27] [I] 
[11/30/2023-16:13:27] [I] === Inference Options ===
[11/30/2023-16:13:27] [I] Batch: Explicit
[11/30/2023-16:13:27] [I] Input inference shapes: model
[11/30/2023-16:13:27] [I] Iterations: 10
[11/30/2023-16:13:27] [I] Duration: 3s (+ 200ms warm up)
[11/30/2023-16:13:27] [I] Sleep time: 0ms
[11/30/2023-16:13:27] [I] Idle time: 0ms
[11/30/2023-16:13:27] [I] Inference Streams: 1
[11/30/2023-16:13:27] [I] ExposeDMA: Disabled
[11/30/2023-16:13:27] [I] Data transfers: Enabled
[11/30/2023-16:13:27] [I] Spin-wait: Disabled
[11/30/2023-16:13:27] [I] Multithreading: Disabled
[11/30/2023-16:13:27] [I] CUDA Graph: Disabled
[11/30/2023-16:13:27] [I] Separate profiling: Disabled
[11/30/2023-16:13:27] [I] Time Deserialize: Disabled
[11/30/2023-16:13:27] [I] Time Refit: Disabled
[11/30/2023-16:13:27] [I] NVTX verbosity: 0
[11/30/2023-16:13:27] [I] Persistent Cache Ratio: 0
[11/30/2023-16:13:27] [I] Inputs:
[11/30/2023-16:13:27] [I] === Reporting Options ===
[11/30/2023-16:13:27] [I] Verbose: Disabled
[11/30/2023-16:13:27] [I] Averages: 10 inferences
[11/30/2023-16:13:27] [I] Percentiles: 90,95,99
[11/30/2023-16:13:27] [I] Dump refittable layers:Disabled
[11/30/2023-16:13:27] [I] Dump output: Disabled
[11/30/2023-16:13:27] [I] Profile: Disabled
[11/30/2023-16:13:27] [I] Export timing to JSON file: trace.json
[11/30/2023-16:13:27] [I] Export output to JSON file: 
[11/30/2023-16:13:27] [I] Export profile to JSON file: 
[11/30/2023-16:13:27] [I] 
[11/30/2023-16:13:27] [I] === Device Information ===
[11/30/2023-16:13:27] [I] Selected Device: NVIDIA GeForce RTX 3090
[11/30/2023-16:13:27] [I] Compute Capability: 8.6
[11/30/2023-16:13:27] [I] SMs: 82
[11/30/2023-16:13:27] [I] Device Global Memory: 24575 MiB
[11/30/2023-16:13:27] [I] Shared Memory per SM: 100 KiB
[11/30/2023-16:13:27] [I] Memory Bus Width: 384 bits (ECC disabled)
[11/30/2023-16:13:27] [I] Application Compute Clock Rate: 1.695 GHz
[11/30/2023-16:13:27] [I] Application Memory Clock Rate: 9.751 GHz
[11/30/2023-16:13:27] [I] 
[11/30/2023-16:13:27] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[11/30/2023-16:13:27] [I] 
[11/30/2023-16:13:27] [I] TensorRT version: 8.6.1
[11/30/2023-16:13:27] [I] Loading standard plugins
[11/30/2023-16:13:28] [I] [TRT] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 21, GPU 1255 (MiB)
[11/30/2023-16:13:41] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1451, GPU +266, now: CPU 1548, GPU 1521 (MiB)
[11/30/2023-16:13:41] [I] Start parsing network model.
[11/30/2023-16:13:41] [I] [TRT] ----------------------------------------------------------------
[11/30/2023-16:13:41] [I] [TRT] Input filename:   model.onnx
[11/30/2023-16:13:41] [I] [TRT] ONNX IR version:  0.0.9
[11/30/2023-16:13:41] [I] [TRT] Opset version:    19
[11/30/2023-16:13:41] [I] [TRT] Producer name:    NVIDIA TensorRT sample
[11/30/2023-16:13:41] [I] [TRT] Producer version: 
[11/30/2023-16:13:41] [I] [TRT] Domain:           
[11/30/2023-16:13:41] [I] [TRT] Model version:    0
[11/30/2023-16:13:41] [I] [TRT] Doc string:       
[11/30/2023-16:13:41] [I] [TRT] ----------------------------------------------------------------
[11/30/2023-16:13:41] [I] Finished parsing network model. Parse time: 0.458234
[11/30/2023-16:13:41] [I] [TRT] Graph optimization time: 0.241336 seconds.
[11/30/2023-16:13:41] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[11/30/2023-16:16:08] [I] [TRT] Detected 1 inputs and 3 output network tensors.
[11/30/2023-16:16:09] [I] [TRT] Total Host Persistent Memory: 605104
[11/30/2023-16:16:09] [I] [TRT] Total Device Persistent Memory: 72192
[11/30/2023-16:16:09] [I] [TRT] Total Scratch Memory: 3146752
[11/30/2023-16:16:09] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 268 MiB, GPU 379 MiB
[11/30/2023-16:16:09] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 297 steps to complete.
[11/30/2023-16:16:09] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 13.3046ms to assign 7 blocks to 297 nodes requiring 145647616 bytes.
[11/30/2023-16:16:09] [I] [TRT] Total Activation Memory: 145647616
[11/30/2023-16:16:09] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +243, GPU +310, now: CPU 243, GPU 310 (MiB)
[11/30/2023-16:16:09] [I] Engine built in 162.003 sec.
[11/30/2023-16:16:10] [I] [TRT] Loaded engine size: 313 MiB
[11/30/2023-16:16:10] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +309, now: CPU 0, GPU 309 (MiB)
[11/30/2023-16:16:10] [I] Engine deserialized in 0.118402 sec.
[11/30/2023-16:16:10] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +139, now: CPU 0, GPU 448 (MiB)
[11/30/2023-16:16:10] [I] Setting persistentCacheLimit to 0 bytes.
[11/30/2023-16:16:10] [I] Using random values for input 000_net
[11/30/2023-16:16:10] [I] Input binding for 000_net with dimensions 1x3x608x608 is created.
[11/30/2023-16:16:10] [I] Output binding for 139_convolutional with dimensions 1x75x76x76 is created.
[11/30/2023-16:16:10] [I] Output binding for 150_convolutional with dimensions 1x75x38x38 is created.
[11/30/2023-16:16:10] [I] Output binding for 161_convolutional with dimensions 1x75x19x19 is created.
[11/30/2023-16:16:10] [I] Starting inference
[11/30/2023-16:16:13] [I] Warmup completed 11 queries over 200 ms
[11/30/2023-16:16:13] [I] Timing trace has 315 queries over 3.03287 s
[11/30/2023-16:16:13] [I] 
[11/30/2023-16:16:13] [I] === Trace details ===
[11/30/2023-16:16:13] [I] Trace averages of 10 runs:
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 10.7557 ms - Host latency: 11.2913 ms (enqueue 1.62425 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.43422 ms - Host latency: 9.95281 ms (enqueue 1.77296 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.47271 ms - Host latency: 9.9888 ms (enqueue 1.5558 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.30315 ms - Host latency: 9.82296 ms (enqueue 2.65082 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.7327 ms - Host latency: 10.2487 ms (enqueue 2.10241 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.31012 ms - Host latency: 9.83085 ms (enqueue 1.76286 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.48531 ms - Host latency: 10.0054 ms (enqueue 1.9568 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.45345 ms - Host latency: 9.97006 ms (enqueue 1.6504 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.37216 ms - Host latency: 9.89117 ms (enqueue 2.2535 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.80562 ms - Host latency: 10.3271 ms (enqueue 2.0222 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.61554 ms - Host latency: 10.1314 ms (enqueue 1.86194 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.54545 ms - Host latency: 10.0641 ms (enqueue 2.04083 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.44806 ms - Host latency: 9.96426 ms (enqueue 1.94412 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.41013 ms - Host latency: 9.92666 ms (enqueue 1.68965 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.59486 ms - Host latency: 10.1178 ms (enqueue 2.00084 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.55466 ms - Host latency: 10.0734 ms (enqueue 1.77604 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.54042 ms - Host latency: 10.0582 ms (enqueue 1.92233 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.515 ms - Host latency: 10.0319 ms (enqueue 1.90164 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.34965 ms - Host latency: 9.86656 ms (enqueue 2.14177 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.56198 ms - Host latency: 10.0819 ms (enqueue 2.04242 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.37832 ms - Host latency: 9.89636 ms (enqueue 2.50942 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.4113 ms - Host latency: 9.93528 ms (enqueue 2.11689 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.42666 ms - Host latency: 9.94309 ms (enqueue 1.47251 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.31069 ms - Host latency: 9.82793 ms (enqueue 2.70776 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.29368 ms - Host latency: 9.81333 ms (enqueue 1.59873 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.6446 ms - Host latency: 10.1716 ms (enqueue 2.44685 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.42156 ms - Host latency: 9.93904 ms (enqueue 1.6731 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.42437 ms - Host latency: 9.94026 ms (enqueue 1.50186 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.38308 ms - Host latency: 9.89871 ms (enqueue 1.65535 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.28198 ms - Host latency: 9.80249 ms (enqueue 2.22273 ms)
[11/30/2023-16:16:13] [I] Average on 10 runs - GPU latency: 9.62429 ms - Host latency: 10.1458 ms (enqueue 2.00603 ms)
[11/30/2023-16:16:13] [I] 
[11/30/2023-16:16:13] [I] === Performance summary ===
[11/30/2023-16:16:13] [I] Throughput: 103.862 qps
[11/30/2023-16:16:13] [I] Latency: min = 9.26605 ms, max = 22.6106 ms, mean = 10.0261 ms, median = 9.67999 ms, percentile(90%) = 11.1342 ms, percentile(95%) = 11.479 ms, percentile(99%) = 11.9709 ms
[11/30/2023-16:16:13] [I] Enqueue Time: min = 1.09375 ms, max = 5.44299 ms, mean = 1.95569 ms, median = 1.77539 ms, percentile(90%) = 2.9458 ms, percentile(95%) = 3.54321 ms, percentile(99%) = 4.56079 ms
[11/30/2023-16:16:13] [I] H2D Latency: min = 0.333252 ms, max = 0.353394 ms, mean = 0.335691 ms, median = 0.335205 ms, percentile(90%) = 0.337646 ms, percentile(95%) = 0.339355 ms, percentile(99%) = 0.341553 ms
[11/30/2023-16:16:13] [I] GPU Compute Time: min = 8.75006 ms, max = 21.9054 ms, mean = 9.50684 ms, median = 9.16171 ms, percentile(90%) = 10.6179 ms, percentile(95%) = 10.9548 ms, percentile(99%) = 11.4514 ms
[11/30/2023-16:16:13] [I] D2H Latency: min = 0.17749 ms, max = 0.368896 ms, mean = 0.183529 ms, median = 0.180359 ms, percentile(90%) = 0.184082 ms, percentile(95%) = 0.20752 ms, percentile(99%) = 0.222046 ms
[11/30/2023-16:16:13] [I] Total Host Walltime: 3.03287 s
[11/30/2023-16:16:13] [I] Total GPU Compute Time: 2.99466 s
[11/30/2023-16:16:13] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/30/2023-16:16:13] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # /home/stephane/Downloads/TensorRT/build/out/trtexec --onnx=model.onnx --exportTimes=trace.json

Hi @stephane.sochacki ,
I see the results have been passed.
Are you able to do inference?

Hi, and sorry for the late answer, I guess I missed the notification…
I actually manage to do the inference. I’m lost with what to do with the data, or more specifically, how to “decode” the data.