RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in error: peer access is not supported between these two devices

Hi there.

We’re encountering the following error when running this command mpirun -n 8 --allow-run-as-root python3 …/run.py --max_output_len=50 --engine_dir ./phi-2-engine-v4/ --input_text “input text” from this repo TensorRT-LLM/examples/run.py at main · NVIDIA/TensorRT-LLM · GitHub

RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in error: peer access is not supported between these two devices (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/ipcUtils.cpp:48)

We’re using 8 x V100’s.

When we’re running with 1, 2, or 4 gpus, everything runs fine.
After analyzing the architecture, we understand that this happens because of the “grouppings” of gpus on nvlinks. This explains why we can run it with no more than 4 gpus as only 4 are directly connected through nvlink. I attach the diagram of what I’m referring to.

Is there a way to make it work with all 8 gpus? Or is is the limitation of the hardware?

#nvidiainception