Yolov8 nvinferserver fp16 not working

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) rtx 2080ti
• DeepStream Version 6.2-tirton
• NVIDIA GPU Driver Version (valid for GPU only) 525
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

I solved the yolov8 nvinferserver problem through

However, when converting tensorrt, I converted the input tensor and output tensor to fp16 to create a tensorrt model and created a triton server, but it did not work.

It works if you leave out the --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw part, but you have to put TYPE_FP32 in the data-type of config.pbtxt for it to work.

This phenomenon is the same for nvinfer.

The model used was the official yolov8 model.

Below is the trtexec conversion code.

/usr/src/tensorrt/bin/trtexec --verbose \
    --onnx=yolov8l.onnx \
    --fp16 \
    --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw \
    --minShapes=images:1x3x640x640 \
    --optShapes=images:12x3x640x640 \
    --maxShapes=images:16x3x640x640 \
    --device 0 \

Below is triton server config.pbtxt.

name: "yolov8_fp16"
platform: "tensorrt_plan"
max_batch_size : 0
input [
    name: "input"
    data_type: TYPE_FP16
    dims: [ -1, 3, 640, 640 ]
output [
    name: "boxes"
    data_type: TYPE_FP16
    dims: [ -1, 8400, 4 ]
    name: "scores"
    data_type: TYPE_FP16
    dims: [ -1, 8400, 1 ]
    name: "classes"
    data_type: TYPE_FP16
    dims: [ -1, 8400, 1 ]
instance_group [
      count: 1
      kind: KIND_GPU
      gpus: [ 0 ]
  max_queue_delay_microseconds: 30000

optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "tensorrt"
    parameters { key: "precision_mode" value: "FP16" }
    parameters { key: "max_workspace_size_bytes" value: "1073741824" }

Below is nvinferserver config.txt.

  infer_config {
    unique_id: 1
    gpu_ids: [0]
    max_batch_size: 0
    backend {
      trt_is {
        model_name: "yolov8_fp16"
        version: 1
        model_repo {
          root: "../../../../samples/triton_yolo"
          log_level: 2
          tf_gpu_memory_fraction: 0.4
          tf_disable_soft_placement: 0

  preprocess {
    network_format: IMAGE_FORMAT_RGB
    tensor_order: TENSOR_ORDER_LINEAR
    tensor_name: "input"
    frame_scaling_hw: FRAME_SCALING_HW_DEFAULT
    frame_scaling_filter: 1
    symmetric_padding: 1
    maintain_aspect_ratio: 1
    normalize {
      scale_factor: 0.0039215697906911373
      channel_offsets: [0.0,0.0,0.0]

  postprocess {
    detection {
    custom_parse_bbox_func: "NvDsInferParseYolo"
    nms {

custom_lib {
    path: "/opt/nvidia/deepstream/deepstream-6.2/sources/deepstream_python_apps-1.1.6/apps/DeepStream-Yolo/nvdsinfer_custom_impl_Yolo_triton/libnvdsinfer_custom_impl_Yolo.so"

what do you mean about “it did not work”? creating engine failed? could you share the whole log?

Engine creation was completed normally, and the triton server also operated normally. However, when I run it, unfortunately the screen does not appear. When converting the engine, if i do not give --inputIOFormats=fp16:chw and --outputIOFormats=fp16:chw, it will work properly, but if you give that argument, it doesn’t seem to work properly.

I tested on my side. here is the result.

  1. I used --inputIOFormats=fp16:chw and --outputIOFormats=fp16:chw to generate engine. using this new engine, the app crashed with this call stack.
    (gdb) bt
    #0 0x00007fffaffa17a9 in decodeTensorYolo(float const*, float const*, float const*, unsigned int const&, …

  2. why do you need “–outputIOFormats=fp16:chw”?

  3. if using --outputIOFormats=fp16:chw, you need to modify the NvDsInferParseCustomYolo because the output data type has changed.

  1. This is because I thought that in order to use performance optimized for fp16, the input and output tensors should be set to fp16.
  2. Even if the input tensor and output tensor are configured as fp32, is there a significant difference in overall inference performance compared to fp16?
  3. Since I don’t know how to handle C++, if I use fp16, should I implement postprocess in python through output_tensor_meta?

“–fp16” means " Enable fp16 precision", it will improve inference performance compared with fp32. “” is used to restrict I/O data type and memory layout, you can find it in this doc for usage.
you don’t need to set inputIOFormats and outputIOFormats because “–fp16” can accelerate.

Thanks for your answer but I still
I’m just learning deepstream, so I don’t understand what you’re talking about.

  1. To summarize, if you want to change the data_type of the input and output tensor of config.pbtxt to TYPE_FP16, there are two ways to try it: change NvDsInferParseCustomYolo, or use the data_type of the input and output tensor of config.pbtxt as TYPE_FP32. ?

  2. I read the doc, but I didn’t really understand the functionality of inputIOFormats and outputIOFormats. I am curious as to whether this affects performance or not, and I would appreciate it if you could explain the function in more detail.

NvDsInferParseCustomYolo is not a try method. after you modify the model’data_type, you need to modify NvDsInferParseCustomYolo accordingly. which is a postprocess parsing bbox function.

This issue would be outside of Deepstream, You could try asking in TensorRT forum.