[Jetson Orin NX | JetPack 6.2 | Metropolis Stack] - Prometheus target error on emdat-analytics (port 6000) + RedisTS issues after RTSP disconnections

System Setup

  • Hardware Platform: Jetson Orin NX 16GB/8GB
  • JetPack Version: 6.2 (L4T 36.4.3)
  • Metropolis Components:
    • ai_nvr 2.0.1 (jps_v1.2.9)
    • nvidia-jetson-services/stable 2.0.0
  • Monitoring Stack: Prometheus + Grafana

Relevant Prometheus Configuration

We have a valid and working prometheus.yaml file with a specific job defined for the emdat-analytics service:

- job_name: 'emdat-analytics'
  scrape_interval: 10s
  static_configs:
    - targets: ['localhost:6000']

Issue Description

We are experiencing intermittent failures in one of our Prometheus targets — specifically, the emdat-analytics endpoint on localhost:6000.

Prometheus shows the following error:

Error scraping target:
Get "http://localhost:6000/metrics": dial tcp 127.0.0.1:6000: connect: connection refused

  • This also fails locally on the host (curl http://localhost:6000/metrics).
  • The port seems closed and no process is listening after some time.
  • We have confirmed that the target is correctly defined in prometheus.yaml (not autogenerated).


Runtime Observations

This behavior seems to be associated with the emdx-analytics module, likely involving Redis TimeSeries and streaming conditions:

  • We are using RTSP streams with DeepStream.
  • RTSP cameras intermittently disconnect (likely due to network or source instability).
  • After disconnection:
    • DeepStream doesn’t reconnect.
    • The SDR service does not re-aggregate the sensor streams.
    • Visual elements such as ROIs and Tripwires are visible in the VST, but their counters remain stuck at 0.
  • In some cases, restarting services manually restores functionality:
    • Restarting sdr, sdr-emdx, and emdx-analytics-01/02 is sometimes enough.
    • Other times, we must restart the entire Docker Compose stack.

Technical Questions

  1. Which service or component is responsible for exposing the /metrics endpoint on port 6000?
  • We suspect it’s part of emdx-analytics, but need confirmation.
  • Is there a specific binary, script, or container responsible?
  1. Why does the /metrics endpoint become unavailable over time?
  • Are there any known issues with RedisTS or metric exporters under unstable input conditions?
  • What logs or traces can we inspect to better understand the root cause?
  1. How can we automate service recovery when this happens?
  • We’re considering using Grafana Alerts + Webhooks to restart services automatically.
    • Is this a valid approach in the context of Metropolis?
    • Would restarting specific containers (instead of full docker-compose down && up) be safe?
  • Are there any NVIDIA-recommended tools or patterns for high-availability service recovery?

Available Artifacts

We can provide:

  • The full prometheus.yaml
  • Docker Compose setup and logs (docker logs, journalctl, etc.)
  • Screenshots from Prometheus and Grafana dashboards
  • Any container logs relevant to emdx-analytics, sdr, and DeepStream

Summary

We’re trying to:

  • Identify why the emdat-analytics Prometheus target becomes unavailable.
  • Understand the underlying failure mode of the /metrics endpoint on port 6000.
  • Implement an automated recovery strategy using monitoring alerts.

We would appreciate any guidance from the NVIDIA team or community users who have experienced similar behavior.

Thanks for your report. I will check and feedback later.