Hello Timothee,
I have started using export UCX_TLS=^cuda_ipc when more than one node is used, as long as time spent in intra-node communications is not a bottleneck with respect to time spent in inter-node. However this is probably not the best thing to do, I hope nvidia support can providea better reply :)
Btw you can also try binding against hpcx, I have seen better performances without all these exports
Let me know it this helps?
See you,
Laura