Memory increase in GPU-aware non-blocking MPI communications

Hello Timothee,

I have started using export UCX_TLS=^cuda_ipc when more than one node is used, as long as time spent in intra-node communications is not a bottleneck with respect to time spent in inter-node. However this is probably not the best thing to do, I hope nvidia support can providea better reply :)

Btw you can also try binding against hpcx, I have seen better performances without all these exports

Let me know it this helps?

See you,

Laura