Nv-monitor: Add RDMA/InfiniBand metrics (data export only) - testers needed

Could anyone with the appropriate hardware access test nv-monitor is correctly exporting RDMA metrics please? I don’t have access to the hardware to do so. Binaries are built by the ci, so should be quick to dump it to CSV or to your OTel dashboard.

If there’s any requirement for opt-in UI enhancements, I can try to do it blind, if you can raise it to me. Or let me loose on your infra and I’ll test it myself 😊

Paul

Beautiful project ● As Claude mentioned, I’ve added this because there was an issue with commas and Prometheus. Done. Summary of what was documented and fixed:

Fix applied

Root cause: setlocale(LC_ALL, "") in main() was inheriting es_ES.UTF-8 from the system, causing snprintf("%.2f") to produce 1.00 (comma) instead of 1.00 (period) — an invalid format for Prometheus.

Fix in code (nv-monitor.c:1757):

setlocale(LC_ALL, "");

setlocale(LC_NUMERIC, "C"); /* Force decimal point for Prometheus exposition format */`

This forces a decimal point for floats without breaking Unicode/ncursesw.

Edit: Hah - it’s a one line change. v1.5.1 should be building now.

Thank you for the kind words and this is a great catch!

My first PR, how exciting! Please check it out.: Fix Prometheus scrape failure on non-English locales by vedcsolution · Pull Request #13 · wentbackward/nv-monitor · GitHub

Thank you! We had fixed the issue based on your earlier report but the PR was still valuable as I did not update the documentation!! Thank you for doing that!

- Paul

Happy to do this on my duo cluster. But I am new to this, can you share instructions what to run and how to share results with you?

That would be wonderful - there are binaries built by github’s ci. If you’re ok to download those it will save having to build it.

Assuming you have a spark cluster (Arm64) - Use nv-monitor-linux-amd64 for x86_64 systems

# Check there are infiband devices available - if not, then quit the test
ls /sys/class/infiniband/

# download the binary to ./nv-monitor
curl -L -o nv-monitor https://github.com/wentbackward/nv-monitor/releases/download/v1.6.1/nv-monitor-linux-arm64

# run the monitor for 30s at 2s intervals
timeout 30 ./nv-monitor -n -l rdma-test.csv -i 2000

# check if rdma output made it into the log
grep -n rdma rdma-test.csv

# clean up
rm -f nv-monitor rdma-test.csv

All I need to check is that you have infiband devices and the nv-monitor produces rdma output. If you see something like this, then it’s a success:

rdma_mlx5_0_p1_xmit_Bps    1234567
rdma_mlx5_0_p1_recv_Bps    7654321

Thank you so much for offering to do this. I hope the instructions make it as easy as possible!

- Paul

Followed instructions. InfiniBand traffic is being logged. It works!

Thank you so much for the confirmation! It was built to spec without the hardware being available to me. I’m glad it works!

🙏🏼