Runtime Error When Executing DOCA UROM Sample Code

Thank you for reading this topic.
Currently, I was running code using the DOCA UROM sample codes urom_multi_worker_bootstrap and worker_graph.

Then I encountered the following logs:

[1760611258.116470] [localhost:15 :0] init.c:122 UCX DEBUG cmd line: /opt/mellanox/doca/services/urom/bin/doca_urom_worker -wp 10003 -sp 10002 -i 0 -n 1 -m 4096 -p /opt/mellanox/doca/samples/doca_urom/plugins/worker_graph/worker_graph.so:4 -l 50 --sdk-log-level 50

and

[1760611258.627016] [localhost:15 :0] tcp_listener.c:134 UCX DEBUG created a TCP listener 0xaaab0b657fe0 on cm 0xaaab0bc669e0 with fd: 95 listening on 0.0.0.0:10003

I expected the UROM worker listening on port 10003 to use the IP address 172.16.0.6 (assigned to the NIC), but instead, it appears as a loopback address on the host side:

[1760611258.809148] [1gpu:492031:1] sock.c:358 UCX DEBUG connect(fd=78, src_addr=127.0.0.1:45506 dest_addr=127.0.0.1:10003): Operation now in progress
[1760611258.809164] [1gpu:492031:1] async.c:247 UCX DEBUG added async handler 0x7f1b5027dbb0 [id=78 ref 1] uct_tcp_sa_data_handler() to hash
[1760611258.809170] [1gpu:492031:1] async.c:521 UCX DEBUG listening to async event fd 78 events 0x2 mode thread_spinlock
[1760611258.809172] [1gpu:492031:1] tcp_sockcm_ep.c:923 UCX DEBUG created a TCP SOCKCM endpoint (fd=78) on tcp cm 0x7f1b501526a0, remote addr: 127.0.0.1:10003

It seems that the address is being passed as a loopback address (127.0.0.1) instead of the expected NIC address.

As a result, the host and the UROM worker could no longer communicate, causing the program to throw an error.

How can this issue be fixed? Or is it a bug?

Here is my current environment:

Environment variables:
Client side:

[1760611257.858539] [1gpu:492031:0] parser.c:2368 UCX INFO UCX_* env variables: UCX_NET_DEVICES=mlx5_0:1,enp179s0f0np0 UCX_HANDLE_ERRORS=bt UCX_IB_GID_INDEX=3 UCX_SOCKADDR_TLS_PRIORITY=tcp,rdmacm UCX_LOG_LEVEL=trace

Server Side:

[1760611140.607364] [localhost:7 :0] parser.c:2314 UCX INFO UCX_* env variables: UCX_NET_DEVICES=enp3s0f0s0,mlx5_2:1,lo UCX_IB_GID_INDEX=1 UCX_MODULE_DIR=/usr/lib/ucx UCX_TLS=all UCX_SOCKADDR_TLS_PRIORITY=tcp,rdmacm UCX_LOG_LEVEL=trace

Versions:
Host: UCX 1.19.0
DPU: UCX 1.17.0

Additionally, DOCA version 3.1.0 is used, but only the image for the UROM daemon is 2.7.0.

I have attached the full logs of the results from running the sample codes urom_multi_worker_bootstrap and worker_grap. Since I re-recorded the logs, the absolute timestamps are not accurate, but I believe you can confirm that the same phenomenon is occurring. To isolate the issue from multithreading, I modified the code so that it runs with only one thread. Additionally, I downgraded the version and ran the code again, but the same error was observed, so I believe the version is not the cause.

server_tcp11.log (1.4 MB)

client_tcp11.log (360.2 KB)

Since I wasn’t sure how to tag the topic, I’ve made the same post at this link as well.

I would greatly appreciate any hints or guidance that could help resolve this issue.

Hi yosei0107,

Thanks for posting your inquiry to the NVIDIA developer forums.

For questions, comments, and feedback regarding the DOCA reference applications, we recommend customers contact the mailing list at DOCA-Feedback@exchange.nvidia.com.

Best regards,
NVIDIA Enterprise Experience

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.