Multi-GPU NCCL 700 'an illegal memory access Failure on NVIDIA B200 GPUs
I am encountering NCCL WARN Cuda failure 700 'an illegal memory access was encountered' errors when running both nccl-tests and vLLM with multi-GPU setups on NVIDIA B200 GPUs. Single GPU runs work correctly, but multi-GPU operations fail immediately.
NCCL Test
Command (1 GPU):
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
- Works correctly, shows bandwidth results.
Command (2 GPUs):
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
- Fails with:
Test NCCL failure common.cu:536 'unhandled cuda error'
NCCL WARN Cuda failure 700 'an illegal memory access was encountered'
vLLM Multi-GPU Serving
Command:
vllm serve --model deepseek-ai/DeepSeek-Coder-V2-Lite-Base --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 --gpu-memory-utilization 0.8
Logs warnings:
ERROR 02-10 03:17:41 [multiproc_executor.py:772] WorkerProc failed to start.
ERROR 02-10 03:17:41 [multiproc_executor.py:772] Traceback (most recent call last):
ERROR 02-10 03:17:41 [multiproc_executor.py:772] File "/root/.env/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 743, in worker_main
ERROR 02-10 03:17:41 [multiproc_executor.py:772] worker = WorkerProc(*args, **kwargs)
ERROR 02-10 03:17:41 [multiproc_executor.py:772] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772] File "/root/.env/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 569, in __init__
ERROR 02-10 03:17:41 [multiproc_executor.py:772] self.worker.init_device()
ERROR 02-10 03:17:41 [multiproc_executor.py:772] File "/root/.env/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 326, in init_device
ERROR 02-10 03:17:41 [multiproc_executor.py:772] self.worker.init_device() # type: ignore
ERROR 02-10 03:17:41 [multiproc_executor.py:772] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772] File "/root/.env/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 218, in init_device
ERROR 02-10 03:17:41 [multiproc_executor.py:772] init_worker_distributed_environment(
ERROR 02-10 03:17:41 [multiproc_executor.py:772] File "/root/.env/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 956, in init_worker_distributed_environment
ERROR 02-10 03:17:41 [multiproc_executor.py:772] ensure_model_parallel_initialized(
ERROR 02-10 03:17:41 [multiproc_executor.py:772] File "/root/.env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 1450, in ensure_model_parallel_initialized
ERROR 02-10 03:17:41 [multiproc_executor.py:772] initialize_model_parallel(
ERROR 02-10 03:17:41 [multiproc_executor.py:772] File "/root/.env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 1347, in initialize_model_parallel
ERROR 02-10 03:17:41 [multiproc_executor.py:772] _TP = init_model_parallel_group(
ERROR 02-10 03:17:41 [multiproc_executor.py:772] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772] File "/root/.env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 1067, in init_model_parallel_group
ERROR 02-10 03:17:41 [multiproc_executor.py:772] return GroupCoordinator(
ERROR 02-10 03:17:41 [multiproc_executor.py:772] ^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772] File "/root/.env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 362, in __init__
ERROR 02-10 03:17:41 [multiproc_executor.py:772] self.device_communicator = device_comm_cls(
ERROR 02-10 03:17:41 [multiproc_executor.py:772] ^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772] File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 58, in __init__
ERROR 02-10 03:17:41 [multiproc_executor.py:772] self.pynccl_comm = PyNcclCommunicator(
ERROR 02-10 03:17:41 [multiproc_executor.py:772] ^^^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772] File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 146, in __init__
ERROR 02-10 03:17:41 [multiproc_executor.py:772] self.all_reduce(data)
ERROR 02-10 03:17:41 [multiproc_executor.py:772] File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 172, in all_reduce
ERROR 02-10 03:17:41 [multiproc_executor.py:772] self.nccl.ncclAllReduce(
ERROR 02-10 03:17:41 [multiproc_executor.py:772] File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 429, in ncclAllReduce
ERROR 02-10 03:17:41 [multiproc_executor.py:772] self.NCCL_CHECK(
ERROR 02-10 03:17:41 [multiproc_executor.py:772] File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 373, in NCCL_CHECK
ERROR 02-10 03:17:41 [multiproc_executor.py:772] raise RuntimeError(f"NCCL error: {error_str}")
ERROR 02-10 03:17:41 [multiproc_executor.py:772] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
all rings, use ring PXN 0 GDR 1
[2026-02-10 03:17:41] tdx-guest:13723:13723 [0] enqueue.cc:1626 NCCL WARN Cuda failure 700 'an illegal memory access was encountered'
tdx-guest:13723:13723 [0] NCCL INFO group.cc:299 -> 1
tdx-guest:13723:13723 [0] NCCL INFO group.cc:563 -> 1
tdx-guest:13723:13723 [0] NCCL INFO group.cc:694 -> 1
tdx-guest:13723:13723 [0] NCCL INFO enqueue.cc:2432 -> 1
Observed multi-GPU NCCL initialization issues similar to all_reduce_perf.
Environment:
- GPUs: NVIDIA B200
- OS / VM: Running inside the VM following the steps in the NVIDIA TDX Deployment Guide https://docs.nvidia.com/cc-deployment-guide-tdx.pdf
- Host OS: ubuntu 25.10
- VM Guest OS: ubuntu 24.04
- GPU Drivers: NVIDIA 590
- CUDA Toolkit: 13.1
Observation
- Single GPU runs work fine.
- Multi-GPU
> 1runs fail immediately.
NvLink topology
root@tdx-guest:~# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-31 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-31 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 0-31 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-31 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 0-31 0 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 0-31 0 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 0-31 0 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 0-31 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
root@tdx-guest:~# nvidia-smi topo -p2p n
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X OK OK OK OK OK OK OK
GPU1 OK X OK OK OK OK OK OK
GPU2 OK OK X OK OK OK OK OK
GPU3 OK OK OK X OK OK OK OK
GPU4 OK OK OK OK X OK OK OK
GPU5 OK OK OK OK OK X OK OK
GPU6 OK OK OK OK OK OK X OK
GPU7 OK OK OK OK OK OK OK X
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown```