NEMO Customizer "Cuda failure 'operation not supported'" Error on vGPU Environment

wbsong · July 7, 2025, 6:55am

Hi, forum

Problem Description

I’m experiencing a “Cuda failure ‘operation not supported’” error when attempting to run NEMO Customizer fine-tuning on a vGPU environment. The error occurs during NCCL initialization phase, specifically when PyTorch Lightning tries to set up distributed training.

Environment Details

Hardware & Infrastructure:

Virtualization: VMware vSphere managed VMs with vGPU passthrough
GPU: NVIDIA GRID A100D-80C (vGPU, 80GB VRAM)
Driver Version: 570.133.20 (vGPU driver)
CUDA Version: 12.8
Platform: On-premise Kubernetes cluster
Container Runtime: nvidia runtime class
NCCL Version: 2.25.1 (confirmed from error logs: “NCCL version 2.25.1+cuda12.8”)

Software Stack:

NEMO Microservices: Version 25.4.0 (deployed via nemo-microservices-helm-chart)
NEMO Customizer: Running as containerized service within NEMO Microservices
PyTorch Lightning: Included in NEMO container image
Model: meta/llama-3.2-1b-instruct
Training Type: Fine-tuning
Deployment Method: Helm chart installation on Kubernetes

Current Configuration

NEMO Customizer Configuration

customizer:
  enabled: true
  customizerConfig:
    models:
      meta/llama-3.2-1b-instruct:
        enabled: true
    training:
      pvc:
        storageClass: "longhorn"
        size: 10Gi
    trainingNetworking:
      - name: CUDA_VISIBLE_DEVICES
        value: "0"
      - name: NCCL_P2P_DISABLE
        value: "1"
      - name: NCCL_SHM_DISABLE
        value: "1"
      - name: NCCL_LAUNCH_MODE
        value: "GROUP"
      - name: NCCL_IB_DISABLE
        value: "1"
      - name: NCCL_IBEXT_DISABLE
        value: "1"
      - name: NCCL_SOCKET_IFNAME
        value: "lo"
      - name: NCCL_DEBUG
        value: INFO
      - name: UCX_TLS
        value: tcp
      - name: UCX_NET_DEVICES
        value: eth0
  modelsStorage:
    size: 50Gi

Error Logs

Primary Error

NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:328, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'operation not supported'

Detailed NCCL Debug Output

cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO Bootstrap: Using eth0:10.42.10.57<0>
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO cudaDriverVersion 12080
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL version 2.25.1+cuda12.8
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO NET/Socket : Using [0]eth0:10.42.10.57<0>
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO Using network Socket
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] init.cc:423 NCCL WARN Cuda failure 'operation not supported'

Key Stack Trace Points

The error occurs during PyTorch Lightning’s distributed setup:

torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:328, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'operation not supported'

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 1233, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/ddp.py", line 307, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3402, in broadcast_object_list
    broadcast(object_sizes_tensor, src=global_src, group=group)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast
    work = group.broadcast([tensor], opts)

Complete Pod Logs

For additional debugging information, I’ve attached the complete pod log file:
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0_main.log (39.3 KB)
This contains the full training job execution logs including all NCCL output and error details.

nvidia-smi Output

Thu Jul  3 07:52:22 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  GRID A100D-80C                 On  |   00000000:02:00.0 Off |                    0 |
| N/A   N/A    P0            N/A  /  N/A  |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Questions

Is NEMO Customizer officially supported on vGPU environments? The error suggests CUDA operations are not supported, which might be a vGPU limitation.
How can I identify which specific CUDA operation is failing? The error message only shows “operation not supported” but doesn’t specify which CUDA API call is causing the issue.
Can NEMO Customizer be configured to run in single-GPU mode without NCCL? Since this is a single vGPU setup, distributed training isn’t necessary.
What are the recommended environment variables for vGPU compatibility? Are there additional CUDA/NCCL settings needed for virtualized GPU environments?

Attempted Solutions

I’ve tried comprehensive NCCL configuration with all possible vGPU-compatible settings:

Current Environment Variables

env:
  - name: CUDA_VISIBLE_DEVICES
    value: "0"
  - name: NCCL_P2P_DISABLE
    value: "1"
  - name: NCCL_SHM_DISABLE
    value: "1"
  - name: NCCL_LAUNCH_MODE
    value: "GROUP"
  - name: NCCL_IB_DISABLE
    value: "1"
  - name: NCCL_IBEXT_DISABLE
    value: "1"
  - name: NCCL_SOCKET_IFNAME
    value: "lo"
  - name: NCCL_DEBUG
    value: INFO
  - name: UCX_TLS
    value: tcp
  - name: UCX_NET_DEVICES
    value: eth0

Test Results

Despite comprehensive NCCL configuration including:

All InfiniBand features disabled (NCCL_IB_DISABLE, NCCL_IBEXT_DISABLE)
Shared memory disabled (NCCL_SHM_DISABLE)
P2P communication disabled (NCCL_P2P_DISABLE)
Forced loopback interface (NCCL_SOCKET_IFNAME=lo)
Group launch mode (NCCL_LAUNCH_MODE=GROUP)
Single GPU visibility (CUDA_VISIBLE_DEVICES=0)

The error still persists:

cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL_IBEXT_DISABLE set by environment to 1.
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO Using network Socket
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] init.cc:423 NCCL WARN Cuda failure 'operation not supported'

This suggests the issue is not related to network configuration but rather a fundamental CUDA operation limitation in the vGPU environment.

Expected Behavior

NEMO Customizer should initialize successfully and begin fine-tuning the Llama 3.2 1B model on the vGPU without CUDA operation errors.

Additional Context

This is a test Kubernetes cluster running on vSphere VMs with no physical GPUs — only vGPU resources are available. The infrastructure uses VMware vSphere for VM management with NVIDIA GRID vGPU technology for GPU virtualization.

Any guidance on vGPU-specific configurations, VMware vSphere compatibility considerations, or alternative approaches would be greatly appreciated.
Thanks in advance for any help or insights you can provide!

Topic		Replies	Views
NCCL Error: “invalid device function” - Is it due to NCCL version incompatibility with CUDA 11.3? CUDA Setup and Installation cuda	0	106	January 20, 2025
Stream-Ordered Memory Allocator is not supported on Linux A40 VGPU CUDA Programming and Performance cuda	2	38	June 2, 2025
Failure to install CUDA on WSL 2 Ubuntu CUDA on Windows Subsystem for Linux	65	46841	September 10, 2021
340.106 nvidia-uvm.ko fails to build under kernel 4.14.y Linux	16	7355	October 14, 2021
ncclAllReduce failed: unhandled cuda error DGX User Forum	9	4336	May 27, 2021
Image Classification Pytorch Training Error TAO Toolkit cudnn	10	392	September 23, 2024
CUDA can't initialize after upgrade CUDA Setup and Installation	2	121	May 19, 2025
Internode nvshmme and ib problem GPU-Accelerated Libraries nvshmem	20	1343	April 24, 2024
NCCL declaring Nvidia GPU missing using Pytorch distributed GPU-Accelerated Libraries boot , cuda , ubuntu , nvbugs	1	3640	February 7, 2023
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels GPU-Accelerated Libraries cuda , pytorch , ai-training , a100 , infiniband	0	3930	February 16, 2024