NVIDIA B200: NCCL WARN Cuda failure 700 'an illegal memory access was encountered'

venishpatidar · February 10, 2026, 3:45am

Multi-GPU NCCL 700 'an illegal memory access Failure on NVIDIA B200 GPUs

I am encountering NCCL WARN Cuda failure 700 'an illegal memory access was encountered' errors when running both nccl-tests and vLLM with multi-GPU setups on NVIDIA B200 GPUs. Single GPU runs work correctly, but multi-GPU operations fail immediately.

NCCL Test

Command (1 GPU):

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

Works correctly, shows bandwidth results.

Command (2 GPUs):

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

Fails with:

Test NCCL failure common.cu:536 'unhandled cuda error'
NCCL WARN Cuda failure 700 'an illegal memory access was encountered'

vLLM Multi-GPU Serving

Command:

vllm serve --model deepseek-ai/DeepSeek-Coder-V2-Lite-Base --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 --gpu-memory-utilization 0.8

Logs warnings:

ERROR 02-10 03:17:41 [multiproc_executor.py:772] WorkerProc failed to start.
ERROR 02-10 03:17:41 [multiproc_executor.py:772] Traceback (most recent call last):
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 743, in worker_main
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     worker = WorkerProc(*args, **kwargs)
ERROR 02-10 03:17:41 [multiproc_executor.py:772]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 569, in __init__
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     self.worker.init_device()
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 326, in init_device
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     self.worker.init_device()  # type: ignore
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 218, in init_device
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     init_worker_distributed_environment(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 956, in init_worker_distributed_environment
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     ensure_model_parallel_initialized(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 1450, in ensure_model_parallel_initialized
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     initialize_model_parallel(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 1347, in initialize_model_parallel
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     _TP = init_model_parallel_group(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]           ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 1067, in init_model_parallel_group
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     return GroupCoordinator(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]            ^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 362, in __init__
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     self.device_communicator = device_comm_cls(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]                                ^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 58, in __init__
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     self.pynccl_comm = PyNcclCommunicator(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]                        ^^^^^^^^^^^^^^^^^^^
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 146, in __init__
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     self.all_reduce(data)
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 172, in all_reduce
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     self.nccl.ncclAllReduce(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 429, in ncclAllReduce
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     self.NCCL_CHECK(
ERROR 02-10 03:17:41 [multiproc_executor.py:772]   File "/root/.env/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 373, in NCCL_CHECK
ERROR 02-10 03:17:41 [multiproc_executor.py:772]     raise RuntimeError(f"NCCL error: {error_str}")
ERROR 02-10 03:17:41 [multiproc_executor.py:772] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
all rings, use ring PXN 0 GDR 1

[2026-02-10 03:17:41] tdx-guest:13723:13723 [0] enqueue.cc:1626 NCCL WARN Cuda failure 700 'an illegal memory access was encountered'
tdx-guest:13723:13723 [0] NCCL INFO group.cc:299 -> 1
tdx-guest:13723:13723 [0] NCCL INFO group.cc:563 -> 1
tdx-guest:13723:13723 [0] NCCL INFO group.cc:694 -> 1
tdx-guest:13723:13723 [0] NCCL INFO enqueue.cc:2432 -> 1

Observed multi-GPU NCCL initialization issues similar to all_reduce_perf.

Environment:

GPUs: NVIDIA B200
OS / VM: Running inside the VM following the steps in the NVIDIA TDX Deployment Guide https://docs.nvidia.com/cc-deployment-guide-tdx.pdf
Host OS: ubuntu 25.10
VM Guest OS: ubuntu 24.04
GPU Drivers: NVIDIA 590
CUDA Toolkit: 13.1

Observation

Single GPU runs work fine.
Multi-GPU > 1 runs fail immediately.

NvLink topology

root@tdx-guest:~# nvidia-smi topo -m

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	0-31	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	0-31	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	0-31	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	0-31	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	0-31	0		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	0-31	0		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	0-31	0		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	0-31	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

root@tdx-guest:~# nvidia-smi topo -p2p n

 	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7
 GPU0	X	OK	OK	OK	OK	OK	OK	OK
 GPU1	OK	X	OK	OK	OK	OK	OK	OK
 GPU2	OK	OK	X	OK	OK	OK	OK	OK
 GPU3	OK	OK	OK	X	OK	OK	OK	OK
 GPU4	OK	OK	OK	OK	X	OK	OK	OK
 GPU5	OK	OK	OK	OK	OK	X	OK	OK
 GPU6	OK	OK	OK	OK	OK	OK	X	OK
 GPU7	OK	OK	OK	OK	OK	OK	OK	X

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown```

Yifan-Tan · February 13, 2026, 3:59pm

You are using pPCIe (protected PCIe) mode, aren’t you?
Could you try some simple applications with peer-to-peer cudaMemcpyAsync?

Topic		Replies	Views
NCCL can't use IB network GPU-Accelerated Libraries ubuntu , cudnn , nccl	2	1992	October 11, 2023
Problems migrating to multi-gpu setting Deep Learning (Training & Inference) pytorch , python , cloud	1	1540	March 5, 2024
Code runs in RTX 3060 but not in 4xTesla T4 Azure cluster Microsoft Azure Image pytorch , python , cudnn	0	487	March 5, 2024
NCCL example fails on WSL2 and 1 or 2 A5500's cuDNN cuda	3	239	September 15, 2024
NCCL failure common.cu:908 'unhandled cuda error'. Deep Learning (Training & Inference)	1	1450	April 26, 2018
NCCL error when training data in GCP GPU-Accelerated Libraries cuda , tensorflow , ubuntu , python	2	1511	August 23, 2024
COMPUTE-SANITIZER error 500 when running NCCL demo GPU-Accelerated Libraries nccl	0	793	September 29, 2022
NCCL error on multi machine. transport/p2p.cu :515 WARN failed to open CUDA IPC handle : 30 unknown error Deep Learning (Training & Inference)	0	888	May 31, 2018
CUDA NCCL Error "operation not supported" Multi-GPUs CUDA Setup and Installation cuda	1	793	June 26, 2025
NCCL failure : "unhandled system error" for 2 GPUs CUDA on Windows Subsystem for Linux	1	4363	January 21, 2021