Batch inference using inswapper model

Setup:
• Hardware Platform (Jetson / GPU): Jetson AGX Orin
• DeepStream Version : 7.1
• JetPack Version (valid for Jetson only): 6.1
• TensorRT Version: 10.3

As stated in the issue, Double free or corruption (out) in Deepstream 7.1 in Jetson AGX Orin Jetpack 6.1 for custom secondary gie. I am trying to run a deepstream app which perform face-swapping. I have also added the entire pipeline image in the same issue. The existing deepstream pipeline now works as expected.

Goal: I am now trying to implement batch processing using the inswapper model which is able to take multiple face objects at once and output in similar way.

The architecture of model is as follows:

    ---- 2 Engine Input(s) ----
    {target [dtype=float32, shape=(-1, 3, 128, 128)],
     source [dtype=float32, shape=(-1, 512)]}
    
    ---- 1 Engine Output(s) ----
    {output [dtype=float32, shape=(-1, 3, 128, 128)]}
    
    ---- Memory ----
    Device Memory: 302415872 bytes
    
    ---- 1 Profile(s) (3 Tensor(s) Each) ----
    - Profile: 0
        Tensor: target          (Input), Index: 0 | Shapes: min=(1, 3, 128, 128), opt=(16, 3, 128, 128), max=(16, 3, 128, 128)
        Tensor: source          (Input), Index: 1 | Shapes: min=(1, 512), opt=(16, 512), max=(16, 512)
        Tensor: output         (Output), Index: 2 | Shape: (-1, 3, 128, 128)
    
    ---- 138 Layer(s) ----

I have verified that the model works for multiple faces at once by running in python script. I want to implement this in deepstream.

Issue: Unable to perform batch inference using the above model. It still inference only one face object at a time.

What I’ve tried:

  • Tried setting batch_size=16 in main_app_config.txt and swap_config.txt. Also, set network_input_shape=16;3;128;128 in secondary_preprocess.txt.
  • Tried to batch face objects in preprocessing assuming that the model will be able to infer on all the objects at once.

When checking the output of the model through outputLayersInfo, I observe that the output of the model is still of the shape [3, 128, 128] rather than [N, 3, 128, 128]. So, the model is still processing the face objects one by one.

How can I implement batch inference for the face objects in my pipeline?

Thank you!

You need to try with TensorRT APIs for the multiple images batch case to verify your model. Please try to write your sample of TensorRT. Or you can refer to our sample model to check how the ONNX model can generate batched output.

This is not a DeepStream issue.

@Fiona.Chen
Thank you for the quick reply!

I don’t think the issue is with my model. I have verified the engine file as you can see from the following logs:

usr/src/tensorrt/bin/trtexec --loadEngine=/media/usb/models/reswapper_dynamic.onnx_b16_gpu0_fp32.engine --shapes=target:16x3x128x128,source:16x512
&&&& RUNNING TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --loadEngine=/media/usb/models/reswapper_dynamic.onnx_b16_gpu0_fp32.engine --shapes=target:16x3x128x128,source:16x512
[06/04/2025-19:13:30] [I] === Model Options ===
[06/04/2025-19:13:30] [I] Format: *
[06/04/2025-19:13:30] [I] Model: 
[06/04/2025-19:13:30] [I] Output:
[06/04/2025-19:13:30] [I] 
[06/04/2025-19:13:30] [I] === System Options ===
[06/04/2025-19:13:30] [I] Device: 0
[06/04/2025-19:13:30] [I] DLACore: 
[06/04/2025-19:13:30] [I] Plugins:
[06/04/2025-19:13:30] [I] setPluginsToSerialize:
[06/04/2025-19:13:30] [I] dynamicPlugins:
[06/04/2025-19:13:30] [I] ignoreParsedPluginLibs: 0
[06/04/2025-19:13:30] [I] 
[06/04/2025-19:13:30] [I] === Inference Options ===
[06/04/2025-19:13:30] [I] Batch: Explicit
[06/04/2025-19:13:30] [I] Input inference shape : source=16x512
[06/04/2025-19:13:30] [I] Input inference shape : target=16x3x128x128
[06/04/2025-19:13:30] [I] Iterations: 10
[06/04/2025-19:13:30] [I] Duration: 3s (+ 200ms warm up)
[06/04/2025-19:13:30] [I] Sleep time: 0ms
[06/04/2025-19:13:30] [I] Idle time: 0ms
[06/04/2025-19:13:30] [I] Inference Streams: 1
[06/04/2025-19:13:30] [I] ExposeDMA: Disabled
[06/04/2025-19:13:30] [I] Data transfers: Enabled
[06/04/2025-19:13:30] [I] Spin-wait: Disabled
[06/04/2025-19:13:30] [I] Multithreading: Disabled
[06/04/2025-19:13:30] [I] CUDA Graph: Disabled
[06/04/2025-19:13:30] [I] Separate profiling: Disabled
[06/04/2025-19:13:30] [I] Time Deserialize: Disabled
[06/04/2025-19:13:30] [I] Time Refit: Disabled
[06/04/2025-19:13:30] [I] NVTX verbosity: 0
[06/04/2025-19:13:30] [I] Persistent Cache Ratio: 0
[06/04/2025-19:13:30] [I] Optimization Profile Index: 0
[06/04/2025-19:13:30] [I] Weight Streaming Budget: 100.000000%
[06/04/2025-19:13:30] [I] Inputs:
[06/04/2025-19:13:30] [I] Debug Tensor Save Destinations:
[06/04/2025-19:13:30] [I] === Reporting Options ===
[06/04/2025-19:13:30] [I] Verbose: Disabled
[06/04/2025-19:13:30] [I] Averages: 10 inferences
[06/04/2025-19:13:30] [I] Percentiles: 90,95,99
[06/04/2025-19:13:30] [I] Dump refittable layers:Disabled
[06/04/2025-19:13:30] [I] Dump output: Disabled
[06/04/2025-19:13:30] [I] Profile: Disabled
[06/04/2025-19:13:30] [I] Export timing to JSON file: 
[06/04/2025-19:13:30] [I] Export output to JSON file: 
[06/04/2025-19:13:30] [I] Export profile to JSON file: 
[06/04/2025-19:13:30] [I] 
[06/04/2025-19:13:30] [I] === Device Information ===
[06/04/2025-19:13:30] [I] Available Devices: 
[06/04/2025-19:13:30] [I]   Device 0: "Orin" UUID: GPU-8d2a93dd-b960-5cb3-86c0-c70c99cd0a0e
[06/04/2025-19:13:30] [I] Selected Device: Orin
[06/04/2025-19:13:30] [I] Selected Device ID: 0
[06/04/2025-19:13:30] [I] Selected Device UUID: GPU-8d2a93dd-b960-5cb3-86c0-c70c99cd0a0e
[06/04/2025-19:13:30] [I] Compute Capability: 8.7
[06/04/2025-19:13:30] [I] SMs: 16
[06/04/2025-19:13:30] [I] Device Global Memory: 62840 MiB
[06/04/2025-19:13:30] [I] Shared Memory per SM: 164 KiB
[06/04/2025-19:13:30] [I] Memory Bus Width: 256 bits (ECC disabled)
[06/04/2025-19:13:30] [I] Application Compute Clock Rate: 1.3 GHz
[06/04/2025-19:13:30] [I] Application Memory Clock Rate: 1.3 GHz
[06/04/2025-19:13:30] [I] 
[06/04/2025-19:13:30] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[06/04/2025-19:13:30] [I] 
[06/04/2025-19:13:30] [I] TensorRT version: 10.3.0
[06/04/2025-19:13:30] [I] Loading standard plugins
[06/04/2025-19:13:31] [I] [TRT] Loaded engine size: 78 MiB
[06/04/2025-19:13:31] [I] Engine deserialized in 0.0739509 sec.
[06/04/2025-19:13:31] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +289, now: CPU 0, GPU 366 (MiB)
[06/04/2025-19:13:31] [I] Setting persistentCacheLimit to 0 bytes.
[06/04/2025-19:13:31] [I] Set shape of input tensor target to: 16x3x128x128
[06/04/2025-19:13:31] [I] Set shape of input tensor source to: 16x512
[06/04/2025-19:13:31] [I] Created execution context with device memory size: 288.406 MiB
[06/04/2025-19:13:31] [I] Using random values for input target
[06/04/2025-19:13:31] [I] Input binding for target with dimensions 16x3x128x128 is created.
[06/04/2025-19:13:31] [I] Using random values for input source
[06/04/2025-19:13:31] [I] Input binding for source with dimensions 16x512 is created.
[06/04/2025-19:13:31] [I] Output binding for output with dimensions 16x3x128x128 is created.
[06/04/2025-19:13:31] [I] Starting inference
[06/04/2025-19:13:34] [I] Warmup completed 4 queries over 200 ms
[06/04/2025-19:13:34] [I] Timing trace has 54 queries over 3.17112 s
[06/04/2025-19:13:34] [I] 
[06/04/2025-19:13:34] [I] === Trace details ===
[06/04/2025-19:13:34] [I] Trace averages of 10 runs:
[06/04/2025-19:13:34] [I] Average on 10 runs - GPU latency: 57.5567 ms - Host latency: 57.8528 ms (enqueue 0.491925 ms)
[06/04/2025-19:13:34] [I] Average on 10 runs - GPU latency: 57.5644 ms - Host latency: 57.8616 ms (enqueue 0.457556 ms)
[06/04/2025-19:13:34] [I] Average on 10 runs - GPU latency: 57.7458 ms - Host latency: 58.0384 ms (enqueue 0.419116 ms)
[06/04/2025-19:13:34] [I] Average on 10 runs - GPU latency: 57.7255 ms - Host latency: 58.0242 ms (enqueue 0.458887 ms)
[06/04/2025-19:13:34] [I] Average on 10 runs - GPU latency: 57.7195 ms - Host latency: 58.0135 ms (enqueue 0.395703 ms)
[06/04/2025-19:13:34] [I] 
[06/04/2025-19:13:34] [I] === Performance summary ===
[06/04/2025-19:13:34] [I] Throughput: 17.0287 qps
[06/04/2025-19:13:34] [I] Latency: min = 57.1222 ms, max = 58.554 ms, mean = 57.9498 ms, median = 57.9761 ms, percentile(90%) = 58.3058 ms, percentile(95%) = 58.437 ms, percentile(99%) = 58.554 ms
[06/04/2025-19:13:34] [I] Enqueue Time: min = 0.386963 ms, max = 0.645996 ms, mean = 0.442129 ms, median = 0.415405 ms, percentile(90%) = 0.527618 ms, percentile(95%) = 0.573486 ms, percentile(99%) = 0.645996 ms
[06/04/2025-19:13:34] [I] H2D Latency: min = 0.131836 ms, max = 0.157959 ms, mean = 0.139919 ms, median = 0.13739 ms, percentile(90%) = 0.151489 ms, percentile(95%) = 0.154602 ms, percentile(99%) = 0.157959 ms
[06/04/2025-19:13:34] [I] GPU Compute Time: min = 56.8344 ms, max = 58.262 ms, mean = 57.6553 ms, median = 57.6754 ms, percentile(90%) = 58.0114 ms, percentile(95%) = 58.1472 ms, percentile(99%) = 58.262 ms
[06/04/2025-19:13:34] [I] D2H Latency: min = 0.0932617 ms, max = 0.158691 ms, mean = 0.154602 ms, median = 0.155518 ms, percentile(90%) = 0.157593 ms, percentile(95%) = 0.158081 ms, percentile(99%) = 0.158691 ms
[06/04/2025-19:13:34] [I] Total Host Walltime: 3.17112 s
[06/04/2025-19:13:34] [I] Total GPU Compute Time: 3.11338 s
[06/04/2025-19:13:34] [I] Explanations of the performance metrics are printed in the verbose logs.
[06/04/2025-19:13:34] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --loadEngine=/media/usb/models/reswapper_dynamic.onnx_b16_gpu0_fp32.engine --shapes=target:16x3x128x128,source:16x512

However, when I load this model in deepstream, I get the model’s arch like this:

Opening in BLOCKING MODE 
Setting min object dimensions as 16x16 instead of 1x1 to support VIC compute mode.
INFO: [FullDims Engine Info]: layers num: 3
0   INPUT  kFLOAT target          3x128x128       min: 1x3x128x128     opt: 16x3x128x128    Max: 16x3x128x128    
1   INPUT  kFLOAT source          512             min: 1x512           opt: 16x512          Max: 16x512          
2   OUTPUT kFLOAT output          3x128x128       min: 0               opt: 0               Max: 0               

The main issue is I am not able to verify if the model is taking face objects in batch. It should be taking multiple face objects of single frame at once. I think it would be helpful for you to know that I am using a custom preprocessing for this model. A sample of the relevant code is provided:

// Collect all faces from the current frame(s) in this batch
  std::vector<FaceBatchData> current_frame_faces;
  
  for (l_frame = batch_meta->frame_meta_list; l_frame != nullptr; l_frame = l_frame->next) {
      NvDsFrameMeta *frame_meta = reinterpret_cast<NvDsFrameMeta *>(l_frame->data);
      NvDsMetaList *l_obj = nullptr;
 
      std::cout << "\n=== Processing Frame " << frame_meta->frame_num << " ===" << std::endl;
      std::cout << "Starting face collection for frame " << frame_meta->frame_num << std::endl;
 
      for (l_obj = frame_meta->obj_meta_list; l_obj != nullptr; l_obj = l_obj->next) {
          NvDsObjectMeta *obj_meta = reinterpret_cast<NvDsObjectMeta *>(l_obj->data);
          
          if (!obj_meta) continue;
          
          std::cout << "  Found object_id: " << obj_meta->object_id << std::endl;
 
          if (obj_meta->base_meta.meta_type == NVDS_OBJ_META && obj_meta->unique_component_id == 1) {
              keypoints.clear();
              guint num_joints = obj_meta->mask_params.size / (sizeof(float) * 2);
 
              for (guint i = 0; i < num_joints; ++i) {
                  gfloat xc = obj_meta->mask_params.data[i * 2] * (width/640);
                  gfloat yc = obj_meta->mask_params.data[i * 2 + 1] * (width/640);
                  keypoints.push_back(cv::Point2f(xc, yc));
              }
            
              if (keypoints.size() == 5) {
                  cv::Mat M, warp_mat;
                  std::tie(M, warp_mat) = norm_crop2(rgb_image, keypoints, 128);
 
                  // Store transformation matrix in object meta
                  for (int i = 0; i < 6; ++i) {
                      obj_meta->misc_obj_info[i] = *reinterpret_cast<const gint64*>(&M.at<double>(i));
                  }
 
                  warp_mat /= 255.0f;
                  
                  // Create batch data for this frame
                  FaceBatchData face_data;
                  face_data.face = warp_mat.clone();
                  face_data.transform_matrix = M.clone();
                  face_data.obj_meta = obj_meta;
                  face_data.frame_num = frame_meta->frame_num;
                  face_data.object_id = obj_meta->object_id;
                  face_data.batch_index = current_frame_faces.size();
                  
                  current_frame_faces.push_back(face_data);
                  
                  std::cout << "  Added face to frame batch. Object ID: " << obj_meta->object_id
                           << ", Frame faces count: " << current_frame_faces.size() << std::endl;
              }
          }
      }
      
      std::cout << "Finished collecting faces for frame " << frame_meta->frame_num
                << ". Total faces collected: " << current_frame_faces.size() << std::endl;
  }
  
  gst_buffer_unmap(inbuf, &in_map_info);
 
  // Process the batch if we have faces in the current frame
  if (current_frame_faces.size() > 0) {
      std::cout << "\n=== Starting Batch Processing ===" << std::endl;
      std::cout << "Total faces to process in batch: " << current_frame_faces.size() << std::endl;
      
      // Process all faces from this frame as a batch
      char* base_ptr = reinterpret_cast<char*>(buf->memory_ptr);
      size_t planar_size = 128 * 128 * 3 * sizeof(float);
      size_t frame_batch_size = current_frame_faces.size();
      
      std::cout << "Creating tensor batch with shape: [" << frame_batch_size << ", 3, 128, 128]" << std::endl;
      std::cout << "Planar size per face: " << planar_size << " bytes" << std::endl;
      
      for (size_t i = 0; i < frame_batch_size; ++i) {
          FaceBatchData& face_data = current_frame_faces[i];
          
          std::cout << "  Processing face " << i << "/" << frame_batch_size
                   << " (Object ID: " << face_data.object_id
                   << " from frame " << face_data.frame_num << ")" << std::endl;
          
          float* pDst = reinterpret_cast<float*>(base_ptr) + i * (128 * 128 * 3);
          float* planar_memory = (float*)malloc(planar_size);
          
          if (!planar_memory) {
              std::cerr << "Error: Failed to allocate planar_memory!" << std::endl;
              continue;
          }
          
          // Convert to planar format
          for (int j = 0; j < 128 * 128; j++) {
              planar_memory[j] = face_data.face.at<cv::Vec3f>(j)[0];                    // R
              planar_memory[j + 128 * 128] = face_data.face.at<cv::Vec3f>(j)[1];       // G  
              planar_memory[j + 2 * 128 * 128] = face_data.face.at<cv::Vec3f>(j)[2];   // B
          }
          
          cudaError_t err = cudaMemcpy(pDst, planar_memory, planar_size, cudaMemcpyHostToDevice);
          if (err != cudaSuccess) {
              std::cerr << "Error: cudaMemcpy failed! " << cudaGetErrorString(err) << std::endl;
          } else {
              std::cout << "  Successfully copied face " << i << " to GPU memory" << std::endl;
          }
          
          free(planar_memory);
      }
      
      // Store batch metadata for parser to use
      ctx->current_batch_size = frame_batch_size;
      ctx->batch_object_ids.clear();
      for (const auto& face_data : current_frame_faces) {
          ctx->batch_object_ids.push_back(face_data.object_id);
      }
      
      std::cout << "Stored " << ctx->batch_object_ids.size() << " object IDs for batch" << std::endl;
 
      // Update network input shape with actual batch size
      tensorParam.params.network_input_shape[0] = frame_batch_size;
      
      status = ctx->tensor_impl->syncStream();
      if (status != NVDSPREPROCESS_SUCCESS) {
          std::cerr << "Custom Lib: Cuda Stream Synchronization failed" << std::endl;
          acquirer->release(buf);
          return status;
      }
      
      std::cout << "Successfully processed batch of " << frame_batch_size << " faces" << std::endl;
      return NVDSPREPROCESS_SUCCESS;
  } else {
      // No faces in this frame
      std::cout << "No faces found in current frame, skipping..." << std::endl;
      acquirer->release(buf);
      return NVDSPREPROCESS_TENSOR_NOT_READY;
  }
}

Though, the input has been batched in preprocess, the inference is still run for each object which is verified by adding logs in our custom inference function.
Also, there is no increase in performance in comparison to model with batch size =1.

Basically, I am facing implementation issue as the model is working as expected outside of deepstream. Can you please help me with this? Please let me know if you need any other information.

Thank You!

@Fiona.Chen
Can you please provide any clarifications on this? Thank you!

The trtexec only feed input tensor as you set, the output tensor is ignored. From the trtexec result, we can’t know whether the TensorRT output tensor of the engine is what you want.

We have samples of nvpreprocess works with SGIE, we don’t find the nvpreprocess will impact the nvinfer to output object one by one. The nvpreprocess and nvinfer are all open source, you may debug with your code to check what is the root cause.