Run Tritonserver in Nvidia containers with OpenMPI

Hi, I am trying to use OpenMPI to run containers, following the instructions here,

I have successfully built OpenMPI(3.1.4) with UCX(1.10) and CUDA(11.6) support. I am trying the following command to launch the job on 1 node with 2 NVIDIA A100 GPUs:

mpirun -np 2 -npernode 2 --hostfile hosts --mca pml ucx singularity run --nv -B /model_repository:/models tritonserver-22.02-py3.sif tritonserver --model-repository=/models

I expect it to map GPU0 to mpi_rank0, and GPU1 to mpi_rank1, just like common CUDA-aware MPI does. But checking the nvidia-smi, it seems that the application on each MPI process somehow uses all GPUs, which is not ideal.

Screenshot 2023-02-01 215040

I have also tried to call MPI inside the container, as follows:

singularity run --nv -B /model_repository:/models mpirun -np 2 -npernode 2 --hostfile hosts --mca pml ucx tritonserver-22.02-py3.sif tritonserver --model-repository=/models

But it still results the same. I am wondering what could go wrong here.

How is the application setting the rank to device binding?

Typically this done in the program using cudaSetDevice if this is a CUDA program or “acc_set_device” is using OpenACC. (what programming language is this using?)

If the device setting is done after calling MPI_Init, OpenMPI will create a CUDA context on the default device so nvidia-smi will show an extra process per rank. Likely what’s happening here.

It is a bit odd that it appears the contexts are being created on device 1 and then both ranks are setting themselves to use device 0, but possible if the default device is 1.

Hi @MatColgrove , thanks for the reply.

I checked the application GitHub - triton-inference-server/fastertransformer_backend, seems that this is more about this Tritonserver application.

Currently, they don’t support multiprocess per node, I’m just wondering then how intra-node GPUs can communicate with MPI, since they are under one MPI process, aka the same rank.

You’re correct that it’s unlikely they’d be able use CUDA Aware MPI. It’s possible they are GPU Direct calls or NVSHEMM, but more likely they don’t do direct communication. Though I don’t know the application, so probably best to ask the developers.


1 Like

Hi @MatColgrove, thanks for the hint. I will go check GPU Direct and NVSHEMM.