Hi, forum
Problem Description
I’m experiencing a “Cuda failure ‘operation not supported’” error when attempting to run NEMO Customizer fine-tuning on a vGPU environment. The error occurs during NCCL initialization phase, specifically when PyTorch Lightning tries to set up distributed training.
Environment Details
Hardware & Infrastructure:
- Virtualization: VMware vSphere managed VMs with vGPU passthrough
- GPU: NVIDIA GRID A100D-80C (vGPU, 80GB VRAM)
- Driver Version: 570.133.20 (vGPU driver)
- CUDA Version: 12.8
- Platform: On-premise Kubernetes cluster
- Container Runtime: nvidia runtime class
- NCCL Version: 2.25.1 (confirmed from error logs: “NCCL version 2.25.1+cuda12.8”)
Software Stack:
- NEMO Microservices: Version 25.4.0 (deployed via nemo-microservices-helm-chart)
- NEMO Customizer: Running as containerized service within NEMO Microservices
- PyTorch Lightning: Included in NEMO container image
- Model: meta/llama-3.2-1b-instruct
- Training Type: Fine-tuning
- Deployment Method: Helm chart installation on Kubernetes
Current Configuration
NEMO Customizer Configuration
customizer:
enabled: true
customizerConfig:
models:
meta/llama-3.2-1b-instruct:
enabled: true
training:
pvc:
storageClass: "longhorn"
size: 10Gi
trainingNetworking:
- name: CUDA_VISIBLE_DEVICES
value: "0"
- name: NCCL_P2P_DISABLE
value: "1"
- name: NCCL_SHM_DISABLE
value: "1"
- name: NCCL_LAUNCH_MODE
value: "GROUP"
- name: NCCL_IB_DISABLE
value: "1"
- name: NCCL_IBEXT_DISABLE
value: "1"
- name: NCCL_SOCKET_IFNAME
value: "lo"
- name: NCCL_DEBUG
value: INFO
- name: UCX_TLS
value: tcp
- name: UCX_NET_DEVICES
value: eth0
modelsStorage:
size: 50Gi
Error Logs
Primary Error
NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:328, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'operation not supported'
Detailed NCCL Debug Output
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO Bootstrap: Using eth0:10.42.10.57<0>
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO cudaDriverVersion 12080
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL version 2.25.1+cuda12.8
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO NET/Socket : Using [0]eth0:10.42.10.57<0>
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] NCCL INFO Using network Socket
cust-p2etilx1cjpdtiwrhppcuy-training-job-worker-0:1698:1999 [0] init.cc:423 NCCL WARN Cuda failure 'operation not supported'
Key Stack Trace Points
The error occurs during PyTorch Lightning’s distributed setup:
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:328, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'operation not supported'
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 1233, in log_dir
dirpath = self.strategy.broadcast(dirpath)
File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/ddp.py", line 307, in broadcast
torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3402, in broadcast_object_list
broadcast(object_sizes_tensor, src=global_src, group=group)
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast
work = group.broadcast([tensor], opts)
Complete Pod Logs
For additional debugging information, I’ve attached the complete pod log file:
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0_main.log (39.3 KB)
This contains the full training job execution logs including all NCCL output and error details.
nvidia-smi Output
Thu Jul 3 07:52:22 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 GRID A100D-80C On | 00000000:02:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 81920MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Questions
-
Is NEMO Customizer officially supported on vGPU environments? The error suggests CUDA operations are not supported, which might be a vGPU limitation.
-
How can I identify which specific CUDA operation is failing? The error message only shows “operation not supported” but doesn’t specify which CUDA API call is causing the issue.
-
Can NEMO Customizer be configured to run in single-GPU mode without NCCL? Since this is a single vGPU setup, distributed training isn’t necessary.
-
What are the recommended environment variables for vGPU compatibility? Are there additional CUDA/NCCL settings needed for virtualized GPU environments?
Attempted Solutions
I’ve tried comprehensive NCCL configuration with all possible vGPU-compatible settings:
Current Environment Variables
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
- name: NCCL_P2P_DISABLE
value: "1"
- name: NCCL_SHM_DISABLE
value: "1"
- name: NCCL_LAUNCH_MODE
value: "GROUP"
- name: NCCL_IB_DISABLE
value: "1"
- name: NCCL_IBEXT_DISABLE
value: "1"
- name: NCCL_SOCKET_IFNAME
value: "lo"
- name: NCCL_DEBUG
value: INFO
- name: UCX_TLS
value: tcp
- name: UCX_NET_DEVICES
value: eth0
Test Results
Despite comprehensive NCCL configuration including:
- All InfiniBand features disabled (
NCCL_IB_DISABLE
,NCCL_IBEXT_DISABLE
) - Shared memory disabled (
NCCL_SHM_DISABLE
) - P2P communication disabled (
NCCL_P2P_DISABLE
) - Forced loopback interface (
NCCL_SOCKET_IFNAME=lo
) - Group launch mode (
NCCL_LAUNCH_MODE=GROUP
) - Single GPU visibility (
CUDA_VISIBLE_DEVICES=0
)
The error still persists:
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL_IBEXT_DISABLE set by environment to 1.
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] NCCL INFO Using network Socket
cust-fh75d51pen8f7f19rwkvwx-training-job-worker-0:1698:1999 [0] init.cc:423 NCCL WARN Cuda failure 'operation not supported'
This suggests the issue is not related to network configuration but rather a fundamental CUDA operation limitation in the vGPU environment.
Expected Behavior
NEMO Customizer should initialize successfully and begin fine-tuning the Llama 3.2 1B model on the vGPU without CUDA operation errors.
Additional Context
This is a test Kubernetes cluster running on vSphere VMs with no physical GPUs — only vGPU resources are available. The infrastructure uses VMware vSphere for VM management with NVIDIA GRID vGPU technology for GPU virtualization.
Any guidance on vGPU-specific configurations, VMware vSphere compatibility considerations, or alternative approaches would be greatly appreciated.
Thanks in advance for any help or insights you can provide!