Feature: DeepStream Slice Processing Enhancement

β€’ Issue Type( questions, new requirements, bugs)
new requirements

Opening Discussion & Implementation Initiative

Hi DeepStream community and NVIDIA team! πŸ‘‹

I’m opening this discussion to propose a significant enhancement to DeepStream that I believe will be of great importance for the entire ecosystem. I’m planning to start implementing this feature and would love to get support and guidance from both the community and NVIDIA to achieve this goal together.

This enhancement has the potential to dramatically improve performance and efficiency across many DeepStream use cases, and I’m excited to collaborate with everyone to make it a reality.

Summary

Request for a new β€œSlice Processing” feature in DeepStream’s nvdspreprocess plugin as an evolution of the current ROI processing capabilities.

Background

Currently, the nvdspreprocess plugin supports ROI (Region of Interest) processing where each ROI is processed individually, generating separate tensors and requiring multiple inference calls. While this approach is valuable for many use cases, there’s an opportunity to enhance performance and model efficiency through a complementary approach.

Proposed Feature: Slice Processing

Naming Consideration

Since β€œROI” is already extensively used in DeepStream Analytics modules, I suggest using β€œSlice” terminology for this new feature to avoid confusion and better represent its functionality.

Core Concept

The Slice Processing feature would allow users to:

  1. Define a target frame resolution for the combined output

  2. Select multiple slice regions from the original frame

  3. Combine slices into a single mosaic frame at the target resolution

  4. Perform single inference on the combined frame instead of multiple separate inferences

Technical Advantages

Performance Benefits

  • Reduced inference overhead: Single inference call instead of multiple calls

  • Better GPU utilization: Larger batch processing on single tensor

  • Lower memory fragmentation: Single large tensor vs multiple small tensors

Model Efficiency

  • Enhanced object visibility: Objects appear larger in the combined frame due to cropping and scaling

  • Use smaller/lighter models: Higher effective resolution allows for more efficient model architectures

  • Improved detection accuracy: Objects of interest get more pixel representation

Use Case Example

Current ROI Processing:

urrent ROI Processing:

Original Frame: 1920x1080
ROI 1: 400x400 β†’ Individual inference β†’ Tensor 1
ROI 2: 500x500 β†’ Individual inference β†’ Tensor 2
Total: 2 inference calls

Proposed Slice Processing:

Original Frame: 1920x1080
Slice 1: 400x400 β†’ 
Slice 2: 500x500 β†’ Combined into 1280x720 mosaic β†’ Single inference
Total: 1 inference call with higher effective resolution per object

Configuration Proposal

REST API Endpoint

POST /api/v1/slice/update

JSON Schema

json

{
  "stream": {
    "stream_id": "0",
    "slice_mode": "mosaic",
    "target_resolution": {
      "width": 1280,
      "height": 720
    },
    "slice_count": 2,
    "slices": [
      {
        "slice_id": "0",
        "left": 100,
        "top": 300,
        "width": 400,
        "height": 400,
        "position_in_mosaic": {
          "x": 0,
          "y": 0
        }
      },
      {
        "slice_id": "1", 
        "left": 550,
        "top": 300,
        "width": 500,
        "height": 500,
        "position_in_mosaic": {
          "x": 640,
          "y": 0
        }
      }
    ]
  }
}

Configuration File Extension

ini

[property]
enable-slice-processing=1
slice-target-width=1280
slice-target-height=720
slice-mosaic-mode=1

[slice-group-0]
src-ids=0
slice-count=2
slice-0-params=100;300;400;400;0;0
slice-1-params=550;300;500;500;640;0

Implementation Considerations

Processing Pipeline

  1. Extract slices from original frame based on coordinates

  2. Scale/resize each slice to fit target mosaic layout

  3. Combine slices into single frame buffer

  4. Apply standard preprocessing (normalization, format conversion)

  5. Generate single tensor for inference

  6. Map inference results back to original frame coordinates

Memory Management

  • Efficient slice extraction using CUDA kernels

  • Optimized memory copying for mosaic combination

  • Support for different memory types (device, pinned, unified)

Backward Compatibility

  • Maintain existing ROI functionality unchanged

  • Add slice processing as additional feature

  • Allow per-stream configuration of ROI vs Slice mode

Benefits Summary

  1. Performance: Reduced inference calls and better GPU utilization

  2. Efficiency: Use lighter models with better object representation

  3. Flexibility: Choose between ROI (multiple inferences) or Slice (single inference) based on use case

  4. Scalability: Better resource management for multi-stream scenarios

  5. Accuracy: Enhanced object detection through improved pixel density

Request for Community Feedback

This enhancement would significantly benefit applications requiring:

  • High-performance multi-region analysis

  • Resource-constrained deployments

  • Real-time processing of large resolution streams

  • Efficient object detection in specific areas of interest

Would love to hear thoughts from the community and NVIDIA DeepStream team on the feasibility and potential implementation of this feature.

Ref:

What does this mean? The inference times are decided by the inference model’s batch size and the number of ROIs, if the model input batch size is big enough, there is only one tensor and one time inference.

As to my understanding, you want to implement SAHI with nvpreprocess + nvinfer. It is a customized application, you can do it in your way. The nvdspreprocess plugin is a template plugin, current /opt/nvidia/deepstream/deepstream/sources/gst-plugins/gst-nvdspreprocess/nvdspreprocess_lib is more a sample than a fixed implementation and it is totally open source, you can customize any function for your implementation.

Thank you for sharing your good idea!

What does this mean? The inference times are decided by the inference model’s batch size and the number of ROIs, if the model input batch size is big enough, there is only one tensor and one time inference.

I may be wrong but this is how I understood it to work.

ROI Processing

Input: 1 source with 3 ROIs
Configuration: max_batch_size = 1

Processing Flow:
β”œβ”€β”€ ROI 1: 400x400 β†’ Individual inference β†’ Tensor 1 [1, 400, 400, 3]
β”œβ”€β”€ ROI 2: 500x500 β†’ Individual inference β†’ Tensor 2 [1, 500, 500, 3]
└── ROI 3: 300x300 β†’ Individual inference β†’ Tensor 3 [1, 300, 300, 3]

Result: 3 tensors, 3 inferences

Slice Processing

Input: 1 source with 3 ROIs
Configuration: mosaic_mode = true, target_resolution = 1280x720

Processing Flow:
β”œβ”€β”€ ROI 1: 400x400 β†’ Slice 1 (scaled to fit)
β”œβ”€β”€ ROI 2: 500x500 β†’ Slice 2 (scaled to fit) β†’ Combined into 1280x720 mosaic
└── ROI 3: 300x300 β†’ Slice 3 (scaled to fit)

Result: 1 tensor [1, 1280, 720, 3], 1 inference

Multi-Camera Scaling

ROI Processing

Configuration: 4 source, 3 ROIs each, max_batch_size = 1
Total Inferences: 4 cameras Γ— 3 ROIs = 12 inferences
Memory Usage: 12 Γ— individual_tensor_size

Slice Processing

Configuration: 4 source, 3 ROIs each, mosaic_mode = true
Total Inferences: 4 cameras Γ— 1 mosaic = 4 inferences
Memory Usage: 4 Γ— mosaic_tensor_size

Examples - Inference Batch Size 2

Single Camera, Multiple ROIs

Configuration:
- 1 Camera (Source ID = 0)
- 4 ROIs defined
- max_batch_size = 2

Processing:
- Source 0: ROI 1,2 β†’ Batch 1 β†’ Tensor [2, H, W, C]
- Source 0: ROI 3,4 β†’ Batch 2 β†’ Tensor [2, H, W, C]
- Total: 2 batches, 2 tensors

Single Camera, Multiple ROIs (Slice)

Configuration:
- 1 Camera (Source ID = 0)
- 4 ROIs defined
- mosaic_mode = true, target_resolution = 1280x720

Processing:
- Source 0: ROI 1,2,3,4 β†’ Mosaic β†’ Tensor [1, 1280, 720, C]
- Total: 1 batch, 1 tensor

Modern computer vision models are commonly trained using mosaic augmentation techniques, which provides significant advantages when using mosaic processing.

One of the key technical challenges with mosaic processing is coordinate mapping - converting detection coordinates from the mosaic frame back to the original image coordinates.

This enhancement would provide users with the flexibility to choose the most appropriate processing method for their specific use case, while maintaining the excellent existing ROI processing capabilities. β€œMosaic” Mode would be particularly valuable for complex multi-camera deployments where traditional ROI processing would consume excessive resources.

The goal is to give users more options to optimize their pipelines based on their specific requirements, not to replace the proven ROI processing functionality.

Note: This is not a formal feature request but rather an exploration of possibilities to improve the gst-nvdspreprocess plugin’s capabilities. The goal is to:

  • Share ideas about how to address scalability challenges in multi-camera, multi-ROI scenarios

  • Discuss potential solutions that could complement existing ROI processing

  • Explore possibilities for making DeepStream more efficient and flexible

  • Gather community feedback on whether such enhancements would be valuable

@Fiona.Chen Thank you!

If you use nvdspreprocess ROI feature, it does not work in this way, gst-nvinfer does not support dynamic resolutions. Please refer to the source code. What you want to improve is more the application level than the plugin level strategy. The plugins are open source, you can customize the plugins according to your requirement.