NEMO Customizer "Cuda failure 'operation not supported'" Error on vGPU Environment

Hi, forum

Problem Description

I’m experiencing a “Cuda failure ‘operation not supported’” error when attempting to run NEMO Customizer fine-tuning on a vGPU environment. The error occurs during NCCL initialization phase, specifically when PyTorch Lightning tries to set up distributed training.

Environment Details

Hardware & Infrastructure:

  • Virtualization: VMware vSphere managed VMs with vGPU passthrough
  • GPU: NVIDIA GRID A100D-80C (vGPU, 80GB VRAM)
  • Driver Version: 570.133.20 (vGPU driver)
  • CUDA Version: 12.8
  • Platform: On-premise Kubernetes cluster
  • Container Runtime: nvidia runtime class
  • NCCL Version: 2.25.1 (confirmed from error logs: “NCCL version 2.25.1+cuda12.8”)

Software Stack:

  • NEMO Microservices: Version 25.4.0 (deployed via nemo-microservices-helm-chart)
  • NEMO Customizer: Running as containerized service within NEMO Microservices
  • PyTorch Lightning: Included in NEMO container image
  • Model: meta/llama-3.2-1b-instruct
  • Training Type: Fine-tuning
  • Deployment Method: Helm chart installation on Kubernetes

Current Configuration

NEMO Customizer Configuration

customizer:
  enabled: true
  customizerConfig:
    models:
      meta/llama-3.2-1b-instruct:
        enabled: true
    training:
      pvc:
        storageClass: "longhorn"
        size: 10Gi
    trainingNetworking:
      - name: CUDA_VISIBLE_DEVICES
        value: "0"
      - name: NCCL_P2P_DISABLE
        value: "1"
      - name: NCCL_SHM_DISABLE
        value: "1"
      - name: NCCL_LAUNCH_MODE
        value: "GROUP"
      - name: NCCL_IB_DISABLE
        value: "1"
      - name: NCCL_IBEXT_DISABLE
        value: "1"
      - name: NCCL_SOCKET_IFNAME
        value: "lo"
      - name: NCCL_DEBUG
        value: INFO
      - name: UCX_TLS
        value: tcp
      - name: UCX_NET_DEVICES
        value: eth0
  modelsStorage:
    size: 50Gi

Error Logs

Primary Error

NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:328, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'operation not supported'

Detailed NCCL Debug Output

cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO Bootstrap: Using eth0:10.42.10.57<0>
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO cudaDriverVersion 12080
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL version 2.25.1+cuda12.8
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO NET/Socket : Using [0]eth0:10.42.10.57<0>
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO Using network Socket
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] init.cc:423 NCCL WARN Cuda failure 'operation not supported'

Key Stack Trace Points

The error occurs during PyTorch Lightning’s distributed setup:

torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:328, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'operation not supported'

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 1233, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/ddp.py", line 307, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3402, in broadcast_object_list
    broadcast(object_sizes_tensor, src=global_src, group=group)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast
    work = group.broadcast([tensor], opts)

Complete Pod Logs

For additional debugging information, I’ve attached the complete pod log file:
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0_main.log (39.3 KB)
This contains the full training job execution logs including all NCCL output and error details.

nvidia-smi Output

Thu Jul  3 07:52:22 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  GRID A100D-80C                 On  |   00000000:02:00.0 Off |                    0 |
| N/A   N/A    P0            N/A  /  N/A  |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Questions

  1. Is NEMO Customizer officially supported on vGPU environments? The error suggests CUDA operations are not supported, which might be a vGPU limitation.

  2. How can I identify which specific CUDA operation is failing? The error message only shows “operation not supported” but doesn’t specify which CUDA API call is causing the issue.

  3. Can NEMO Customizer be configured to run in single-GPU mode without NCCL? Since this is a single vGPU setup, distributed training isn’t necessary.

  4. What are the recommended environment variables for vGPU compatibility? Are there additional CUDA/NCCL settings needed for virtualized GPU environments?

Attempted Solutions

I’ve tried comprehensive NCCL configuration with all possible vGPU-compatible settings:

Current Environment Variables

env:
  - name: CUDA_VISIBLE_DEVICES
    value: "0"
  - name: NCCL_P2P_DISABLE
    value: "1"
  - name: NCCL_SHM_DISABLE
    value: "1"
  - name: NCCL_LAUNCH_MODE
    value: "GROUP"
  - name: NCCL_IB_DISABLE
    value: "1"
  - name: NCCL_IBEXT_DISABLE
    value: "1"
  - name: NCCL_SOCKET_IFNAME
    value: "lo"
  - name: NCCL_DEBUG
    value: INFO
  - name: UCX_TLS
    value: tcp
  - name: UCX_NET_DEVICES
    value: eth0

Test Results

Despite comprehensive NCCL configuration including:

  • All InfiniBand features disabled (NCCL_IB_DISABLE, NCCL_IBEXT_DISABLE)
  • Shared memory disabled (NCCL_SHM_DISABLE)
  • P2P communication disabled (NCCL_P2P_DISABLE)
  • Forced loopback interface (NCCL_SOCKET_IFNAME=lo)
  • Group launch mode (NCCL_LAUNCH_MODE=GROUP)
  • Single GPU visibility (CUDA_VISIBLE_DEVICES=0)

The error still persists:

cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL_IBEXT_DISABLE set by environment to 1.
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO Using network Socket
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] init.cc:423 NCCL WARN Cuda failure 'operation not supported'

This suggests the issue is not related to network configuration but rather a fundamental CUDA operation limitation in the vGPU environment.

Expected Behavior

NEMO Customizer should initialize successfully and begin fine-tuning the Llama 3.2 1B model on the vGPU without CUDA operation errors.

Additional Context

This is a test Kubernetes cluster running on vSphere VMs with no physical GPUs — only vGPU resources are available. The infrastructure uses VMware vSphere for VM management with NVIDIA GRID vGPU technology for GPU virtualization.

Any guidance on vGPU-specific configurations, VMware vSphere compatibility considerations, or alternative approaches would be greatly appreciated.
Thanks in advance for any help or insights you can provide!