Problem running openmpi/UCX application using SR-IOV and containers in the same host

I am trying to run an openmpi applications within a containers on the same host. The server contains a ConnectX6 DX dual-port 100 GbE card. The a unique VF adapter is attached to the container and each is given its own IP address.

I can run openmpi applications when the nodes are spawned on different servers. (server1(container1) ↔ server2(container2) and see near line rate performance.

However when the 2 containers are located on the same physical server. The openmpi application will run very slowly and I see no traffic on the switch.

Attached is a PDF with additional details.
openmpi_container_issue.pdf (556.7 KB)

why below is 1 host test?

openmpi & IMB-MPI1 pingpong fails to complete running on 1 host (BT = ~0 MB/sec)
Note: No traffic is seen on the switch during the test.
[tester@1211ca0743ed ~]$ /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun -np 2 --mca oob_tcp_if_exclude eth0,lo --mca btl_tcp_if_exclude
eth0,lo --mca pml ucx --mca osc ucx -H, /usr/mpi/gcc/openmpi-4.1.7a1/tests/imb/IMB-MPI1 pingpong

The test just time out, you can increase time limit.

524288 time-out.; Time limit (secs_per_sample * msg_sizes_list_len) is over; use “-time X” or SECS_PER_SAMPLE=X
(IMB_settings.h) to increase time limit.

I’m trying to port an existing MPI application that previously ran one executable per host, containerizing the executables, and running the applications on newer hardware where I should be able to run 2 - 4 containers per host. We are trying to give each container its own VF network device so that they can communicate using the RoCE fabric just as they did previously running as native host application.

As far as increasing the timeout goes I could do that but in the case where I communicate Host1-to-Host2 within my containers I’m able to get 11991 MB/sec with the openmpi test program IMB-MPI1 PingPong (#2 in PDF).

We have dual port ConnectX6 cards in our servers and I would like to be able to give each container its own ConnectX6 port and achieve similar performance to what we have in the existing application today. Unfortunately when I do that I’m only seeing 9 MB/sec (#3 in PDF).

I did some investigation over the weekend doing the tcpdump/wireshark looks at the traffic and I can see RoCE traffic on each of the ConnectX6 adapter ports in #3 (PDF). Therefore my statement about not seeing traffic was incorrect… the traffic IS present but occurring at a much lower rate.

In #1 (from PDF) I used the ucx_perftool application to send traffic between 2 containers running on the same host. Each container instance had its own VF running on a unique port on the dual port ConnectX6 card. In that case I was able to achieve 11631 MB/sec. It seems to me like what I’m trying to do should be possible but when I attempt to use the full stack of openmpi/ucx/mellanox drivers from MOFED things don’t work as the do with the ucx_perftest tool.