NVIDIA B200: NCCL WARN Cuda failure 700 'an illegal memory access was encountered'

Multi-GPU NCCL 700 'an illegal memory access Failure on NVIDIA B200 GPUs

I am encountering NCCL WARN Cuda failure 700 'an illegal memory access was encountered' errors when running both nccl-tests and vLLM with multi-GPU setups on NVIDIA B200 GPUs. Single GPU runs work correctly, but multi-GPU operations fail immediately.

NCCL Test

Command (1 GPU):

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
  • Works correctly, shows bandwidth results.

Command (2 GPUs):

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
  • Fails with:
Test NCCL failure common.cu:536 'unhandled cuda error'
NCCL WARN Cuda failure 700 'an illegal memory access was encountered'

vLLM Multi-GPU Serving

Command:

vllm serve --model deepseek-ai/DeepSeek-Coder-V2-Lite-Base --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 --gpu-memory-utilization 0.8

Logs warnings:

ERROR 02-10 03:17:41 [multiproc_executor.py:772] WorkerProc failed to start.
ERROR 02-10 03:17:41 [multiproc_executor.py:772] Traceback (most recent call last):
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 743, in worker_main
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     worker = WorkerProc(*args, **kwargs)
ERROR 02-10 03:17:41 [multiproc_executor.py:772]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 569, in __init__
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     self.worker.init_device()
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 326, in init_device
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     self.worker.init_device()  # type: ignore
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 218, in init_device
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     init_worker_distributed_environment(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 956, in init_worker_distributed_environment
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     ensure_model_parallel_initialized(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 1450, in ensure_model_parallel_initialized
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     initialize_model_parallel(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 1347, in initialize_model_parallel
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     _TP = init_model_parallel_group(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]           ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 1067, in init_model_parallel_group
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     return GroupCoordinator(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]            ^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 362, in __init__
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     self.device_communicator = device_comm_cls(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]                                ^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 58, in __init__
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     self.pynccl_comm = PyNcclCommunicator(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]                        ^^^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 146, in __init__
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     self.all_reduce(data)
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 172, in all_reduce
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     self.nccl.ncclAllReduce(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 429, in ncclAllReduce
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     self.NCCL_CHECK(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 373, in NCCL_CHECK
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     raise RuntimeError(f"NCCL error: {error_str}")
ERROR 02-10 03:17:41 [multiproc_executor.py:772] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
all rings, use ring PXN 0 GDR 1

[2026-02-10 03:17:41] tdx-guest:13723:13723 [0] enqueue.cc:1626 NCCL WARN Cuda failure 700 'an illegal memory access was encountered'
tdx-guest:13723:13723 [0] NCCL INFO group.cc:299 -> 1
tdx-guest:13723:13723 [0] NCCL INFO group.cc:563 -> 1
tdx-guest:13723:13723 [0] NCCL INFO group.cc:694 -> 1
tdx-guest:13723:13723 [0] NCCL INFO enqueue.cc:2432 -> 1

Observed multi-GPU NCCL initialization issues similar to all_reduce_perf.

Environment:

Observation

  • Single GPU runs work fine.
  • Multi-GPU > 1 runs fail immediately.

NvLink topology

root@tdx-guest:~# nvidia-smi topo -m

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	0-31	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	0-31	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	0-31	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	0-31	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	0-31	0		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	0-31	0		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	0-31	0		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	0-31	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

root@tdx-guest:~# nvidia-smi topo -p2p n

 	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7
 GPU0	X	OK	OK	OK	OK	OK	OK	OK
 GPU1	OK	X	OK	OK	OK	OK	OK	OK
 GPU2	OK	OK	X	OK	OK	OK	OK	OK
 GPU3	OK	OK	OK	X	OK	OK	OK	OK
 GPU4	OK	OK	OK	OK	X	OK	OK	OK
 GPU5	OK	OK	OK	OK	OK	X	OK	OK
 GPU6	OK	OK	OK	OK	OK	OK	X	OK
 GPU7	OK	OK	OK	OK	OK	OK	OK	X

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown```

You are using pPCIe (protected PCIe) mode, aren’t you?
Could you try some simple applications with peer-to-peer cudaMemcpyAsync?