Full NVIDIA CUDA + TensorRT Stack Works, but Production Deployment Remains Unclear

I recently built and ran my full model pipeline using the complete NVIDIA stack and SDK:

  • cuda-repo-wsl-ubuntu-13-2-local_13.2.1-1_amd64.deb (3.29 GB)

  • cudnn-local-repo-ubuntu2404-9.21.1_1.0-1_amd64.deb (1.9 GB)

  • nv-tensorrt-local-repo-ubuntu2404-10.16.1-cuda-13.2_1.0-1_amd64.deb (6.9 GB)

Everything works correctly — training, conversion, and inference setup are all validated.

Now I’m moving toward deployment, and that’s where things become less straightforward.

When exploring container options, I notice a recurring issue: most available images are extremely heavy (6GB compressed and easily 10–20GB after adding dependencies). On the other hand, the alternative path is to start from CUDA base images (~6GB), or use TensorRT lean packages.

However, the “lean” TensorRT runtime does not fully support all operations required by my model, which makes deployment unreliable. This raises a practical question:

What is the intended production path when models rely on ops not supported by the lean runtime, but full devel images become the only stable option (~13GB+ just for inference)?

At the moment, the only reliable solution seems to be using the full TensorRT development container just to achieve inference.

I’m trying to understand the best practices here — especially how production deployments are typically optimized without sacrificing operator support or ballooning container size.

Would love insights from others building real-world inference systems on NVIDIA stacks.