TensorRT-LLM vs. Standard NIM for Production Alert Triage with Agentic AI Agents

Hi NVIDIA community,

I’m architecting an alert triage and incident response agentic AI solution for NVIDIA’s
Global Testing Laboratory infrastructure (managing SaturnV, Selene, and
GPU services). The system uses LangGraph to orchestrate a multi-agent
workflow that investigates and triages infrastructure alerts with the goal
of reducing MTTR by 70-80%.

With a langchain + langgraph agentic AI workflow the agent will handle classification and evidence gathering with potential remediation with human in the loop to mitigate SRE classification overhead

I needed to know if:

  • Have others build with agnetic systems with TensorRT-LLM
  • Any open source examples of multi-agent pipelines
  • Thoughts on NVIDIA NIM usage in this simple use case instead

This would be more of a devops/infra level question because metrics/monitoring/logging will be ingested into this system to