Thank you for reading this topic.
Currently, I was running code using the DOCA UROM sample codes urom_multi_worker_bootstrap and worker_graph.
Then I encountered the following logs:
[1760611258.116470] [localhost:15 :0] init.c:122 UCX DEBUG cmd line: /opt/mellanox/doca/services/urom/bin/doca_urom_worker -wp 10003 -sp 10002 -i 0 -n 1 -m 4096 -p /opt/mellanox/doca/samples/doca_urom/plugins/worker_graph/worker_graph.so:4 -l 50 --sdk-log-level 50
and
[1760611258.627016] [localhost:15 :0] tcp_listener.c:134 UCX DEBUG created a TCP listener 0xaaab0b657fe0 on cm 0xaaab0bc669e0 with fd: 95 listening on 0.0.0.0:10003
I expected the UROM worker listening on port 10003 to use the IP address 172.16.0.6 (assigned to the NIC), but instead, it appears as a loopback address on the host side:
[1760611258.809148] [1gpu:492031:1] sock.c:358 UCX DEBUG connect(fd=78, src_addr=127.0.0.1:45506 dest_addr=127.0.0.1:10003): Operation now in progress
[1760611258.809164] [1gpu:492031:1] async.c:247 UCX DEBUG added async handler 0x7f1b5027dbb0 [id=78 ref 1] uct_tcp_sa_data_handler() to hash
[1760611258.809170] [1gpu:492031:1] async.c:521 UCX DEBUG listening to async event fd 78 events 0x2 mode thread_spinlock
[1760611258.809172] [1gpu:492031:1] tcp_sockcm_ep.c:923 UCX DEBUG created a TCP SOCKCM endpoint (fd=78) on tcp cm 0x7f1b501526a0, remote addr: 127.0.0.1:10003
It seems that the address is being passed as a loopback address (127.0.0.1) instead of the expected NIC address.
As a result, the host and the UROM worker could no longer communicate, causing the program to throw an error.
How can this issue be fixed? Or is it a bug?
Here is my current environment:
Environment variables:
Client side:
[1760611257.858539] [1gpu:492031:0] parser.c:2368 UCX INFO UCX_* env variables: UCX_NET_DEVICES=mlx5_0:1,enp179s0f0np0 UCX_HANDLE_ERRORS=bt UCX_IB_GID_INDEX=3 UCX_SOCKADDR_TLS_PRIORITY=tcp,rdmacm UCX_LOG_LEVEL=trace
Server Side:
[1760611140.607364] [localhost:7 :0] parser.c:2314 UCX INFO UCX_* env variables: UCX_NET_DEVICES=enp3s0f0s0,mlx5_2:1,lo UCX_IB_GID_INDEX=1 UCX_MODULE_DIR=/usr/lib/ucx UCX_TLS=all UCX_SOCKADDR_TLS_PRIORITY=tcp,rdmacm UCX_LOG_LEVEL=trace
Versions:
Host: UCX 1.19.0
DPU: UCX 1.17.0
Additionally, DOCA version 3.1.0 is used, but only the image for the UROM daemon is 2.7.0.
I have attached the full logs of the results from running the sample codes urom_multi_worker_bootstrap and worker_grap. Since I re-recorded the logs, the absolute timestamps are not accurate, but I believe you can confirm that the same phenomenon is occurring. To isolate the issue from multithreading, I modified the code so that it runs with only one thread. Additionally, I downgraded the version and ran the code again, but the same error was observed, so I believe the version is not the cause.
server_tcp11.log (1.4 MB)
client_tcp11.log (360.2 KB)
Since I wasn’t sure how to tag the topic, I’ve made the same post at this link as well.
I would greatly appreciate any hints or guidance that could help resolve this issue.