Real-time Multi-Network Inference Optimization and Hardware Scaling for Industrial Computer Vision Application

Hello everyone,

We are currently developing a real-time Computer Vision system at Ajinomoto, focused on guided operations in an industrial environment. The application consists of multiple custom neural networks running in parallel, each responsible for validating distinct steps within structured operational flows (called “conduct groups”).

Each group is composed of several stages like object detection, product transfer validation, wrapping confirmation, and more — all executed and validated live using video streams. We heavily rely on Docker containers, PyTorch/TensorFlow models, and GPU inference for continuous processing.

📌 Technical Stack:

  • Programming Language: Python
  • Frameworks: PyTorch, TensorFlow (hybrid usage)
  • Containers: Multiple Docker containers (one per neural network pipeline)
  • Real-time video processing with continuous GPU inference

⚙️ Current Hardware:

  • CPU: Intel64 Family 6 Model 183
  • RAM: 32 GB
  • GPU: NVIDIA GeForce RTX 4060
  • OS: Windows 10
  • Deployment: On-premises industrial desktop

📈 Performance Observations:

Phase GPU Usage RAM Usage
Standby 4–22% 49%
Process Start Peaks at 90% 52%
Mid-Process Load 63–66% 53%
CPU Usage 12–19%

The GPU is clearly the bottleneck, especially as we scale with more neural networks and additional logic (PLCs, APIs, parallel inspections).


❓ Our Questions:

  1. Optimization: How can we better optimize GPU usage? Should we integrate NVIDIA SDKs like TensorRT, DeepStream, or CUDA directly into our pipelines?
  2. Hardware Recommendation: What would be the ideal hardware setup as we scale? Should we move to RTX Professional, A-Series, or even Jetson devices for decentralized processing nodes?
  3. Architecture: Any suggestions for architectural improvements when dealing with parallel inference pipelines?

Any recommendations or pointers to similar use cases would be greatly appreciated. Our goal is to build a scalable, robust solution aligned with NVIDIA’s ecosystem both in hardware and software.

Thank you!

For your neural network part, the TensorRT will be more efficient than Pytorch. If you need other functions in your app such as video decoding, OSD(E.G, draw bbox, draw label,…), …, the DeepStream framework may help you to implement the whole app with all necessary hardware acceleration and eliminate all unnecessary memory copy in the whole pipeline.

Depends on the workload and your scenario.

WHat do you mean by “parallel”? What do you want to parallel?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.