Could anyone with the appropriate hardware access test nv-monitor is correctly exporting RDMA metrics please? I don’t have access to the hardware to do so. Binaries are built by the ci, so should be quick to dump it to CSV or to your OTel dashboard.
If there’s any requirement for opt-in UI enhancements, I can try to do it blind, if you can raise it to me. Or let me loose on your infra and I’ll test it myself 😊
Beautiful project ● As Claude mentioned, I’ve added this because there was an issue with commas and Prometheus. Done. Summary of what was documented and fixed:
Fix applied
Root cause: setlocale(LC_ALL, "") in main() was inheriting es_ES.UTF-8 from the system, causing snprintf("%.2f") to produce 1.00 (comma) instead of 1.00 (period) — an invalid format for Prometheus.
Fix in code (nv-monitor.c:1757):
setlocale(LC_ALL, "");
setlocale(LC_NUMERIC, "C"); /* Force decimal point for Prometheus exposition format */`
This forces a decimal point for floats without breaking Unicode/ncursesw.
Thank you! We had fixed the issue based on your earlier report but the PR was still valuable as I did not update the documentation!! Thank you for doing that!
That would be wonderful - there are binaries built by github’s ci. If you’re ok to download those it will save having to build it.
Assuming you have a spark cluster (Arm64) - Use nv-monitor-linux-amd64 for x86_64 systems
# Check there are infiband devices available - if not, then quit the test
ls /sys/class/infiniband/
# download the binary to ./nv-monitor
curl -L -o nv-monitor https://github.com/wentbackward/nv-monitor/releases/download/v1.6.1/nv-monitor-linux-arm64
# run the monitor for 30s at 2s intervals
timeout 30 ./nv-monitor -n -l rdma-test.csv -i 2000
# check if rdma output made it into the log
grep -n rdma rdma-test.csv
# clean up
rm -f nv-monitor rdma-test.csv
All I need to check is that you have infiband devices and the nv-monitor produces rdma output. If you see something like this, then it’s a success: