System Setup
- Hardware Platform: Jetson Orin NX 16GB/8GB
- JetPack Version: 6.2 (L4T 36.4.3)
- Metropolis Components:
ai_nvr
2.0.1 (jps_v1.2.9
)nvidia-jetson-services/stable
2.0.0
- Monitoring Stack: Prometheus + Grafana
Relevant Prometheus Configuration
We have a valid and working prometheus.yaml
file with a specific job defined for the emdat-analytics service:
- job_name: 'emdat-analytics'
scrape_interval: 10s
static_configs:
- targets: ['localhost:6000']
Issue Description
We are experiencing intermittent failures in one of our Prometheus targets — specifically, the emdat-analytics
endpoint on localhost:6000
.
Prometheus shows the following error:
Error scraping target:
Get "http://localhost:6000/metrics": dial tcp 127.0.0.1:6000: connect: connection refused
- This also fails locally on the host (
curl http://localhost:6000/metrics
). - The port seems closed and no process is listening after some time.
- We have confirmed that the target is correctly defined in
prometheus.yaml
(not autogenerated).
Runtime Observations
This behavior seems to be associated with the emdx-analytics
module, likely involving Redis TimeSeries and streaming conditions:
- We are using RTSP streams with DeepStream.
- RTSP cameras intermittently disconnect (likely due to network or source instability).
- After disconnection:
- DeepStream doesn’t reconnect.
- The SDR service does not re-aggregate the sensor streams.
- Visual elements such as ROIs and Tripwires are visible in the VST, but their counters remain stuck at 0.
- In some cases, restarting services manually restores functionality:
- Restarting
sdr
,sdr-emdx
, andemdx-analytics-01/02
is sometimes enough. - Other times, we must restart the entire Docker Compose stack.
- Restarting
Technical Questions
- Which service or component is responsible for exposing the
/metrics
endpoint on port 6000?
- We suspect it’s part of
emdx-analytics
, but need confirmation. - Is there a specific binary, script, or container responsible?
- Why does the
/metrics
endpoint become unavailable over time?
- Are there any known issues with
RedisTS
or metric exporters under unstable input conditions? - What logs or traces can we inspect to better understand the root cause?
- How can we automate service recovery when this happens?
- We’re considering using Grafana Alerts + Webhooks to restart services automatically.
- Is this a valid approach in the context of Metropolis?
- Would restarting specific containers (instead of full
docker-compose down && up
) be safe?
- Are there any NVIDIA-recommended tools or patterns for high-availability service recovery?
Available Artifacts
We can provide:
- The full
prometheus.yaml
- Docker Compose setup and logs (
docker logs
,journalctl
, etc.) - Screenshots from Prometheus and Grafana dashboards
- Any container logs relevant to
emdx-analytics
,sdr
, and DeepStream
Summary
We’re trying to:
- Identify why the
emdat-analytics
Prometheus target becomes unavailable. - Understand the underlying failure mode of the
/metrics
endpoint on port 6000. - Implement an automated recovery strategy using monitoring alerts.
We would appreciate any guidance from the NVIDIA team or community users who have experienced similar behavior.