Image Classification Pytorch Training Error

Please provide the following information when requesting support.

• Nvidia Ada 2000
• Image Classification Pytorch
• Docker Image: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt

Training seems to be using too much VRAM for my GPU and I am not seeing a way to reduce the batch size. However, it looks like from the output it’s using a batch size of 8. I have tried to reduce the batch size down to 1 as well by changing the parameter “samples_per_gpu” as this seems to correlate to batch size but it still gives the same error.

I am training using the TAO launcher command:

tao model classification_pyt train -e $SPECS_DIR/spec_pyt.yaml

I am getting the following error:

2024-08-27 20:33:16,165 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-08-27 20:33:16,392 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-08-27 20:33:17,077 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2024-08-28 00:33:24,781 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
Train results will be saved at: /workspace/tao-experiments/classification_pyt/output
08/28 00:33:34 - mmengine - INFO - 
------------------------------------------------------------
System environment:
    sys.platform: linux
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]    CUDA available: True
    MUSA available: False
    numpy_random_seed: 49
    GPU 0: NVIDIA RTX 2000 Ada Generation Laptop GPU
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.4, V12.4.131
    GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
    PyTorch: 2.3.0a0+6ddf5cf85e.nv24.04
    PyTorch compiling details: PyTorch built with:
  - GCC 11.2
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2021.1-Product Build 20201104 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash N/A)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.4
  - NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_72,code=sm_72;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_87,code=sm_87;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_90,code=compute_90
  - CuDNN 90.1
  - Magma 2.6.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.4, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/c++, CXX_FLAGS=-fno-gnu-unique -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

    TorchVision: 0.18.0a0
    OpenCV: 4.7.0
    MMEngine: 0.10.4

Runtime environment:
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: 49
    deterministic: False
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 1
------------------------------------------------------------

08/28 00:33:34 - mmengine - INFO - Config:
auto_scale_lr = dict(base_batch_size=1024)
custom_hooks = [
    dict(momentum=4e-05, priority='ABOVE_NORMAL', type='EMAHook'),
]
data_preprocessor = dict(
    mean=[
        123.675,
        116.28,
        103.53,
    ],
    num_classes=5,
    std=[
        58.395,
        57.12,
        57.375,
    ],
    to_rgb=True)
dataset_type = 'ImageNet'
default_hooks = dict(
    checkpoint=dict(interval=1, type='CheckpointHook'),
    logger=dict(interval=500, type='TaoTextLoggerHook'),
    param_scheduler=dict(type='ParamSchedulerHook'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    timer=dict(type='IterTimerHook'),
    visualization=dict(enable=False, type='VisualizationHook'))
default_scope = 'mmpretrain'
env_cfg = dict(
    cudnn_benchmark=False,
    dist_cfg=dict(backend='nccl'),
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
find_unused_parameters = False
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
model = dict(
    backbone=dict(
        drop_path=0.1,
        freeze=False,
        init_cfg=None,
        pretrained='',
        type='fan_small_12_p4_hybrid'),
    head=dict(
        binary=False,
        head_init_scale=1,
        in_channels=384,
        loss=dict(loss_weight=1.0, type='CrossEntropyLoss', use_soft=False),
        num_classes=5,
        type='TAOLinearClsHead'),
    neck=None,
    train_cfg=dict(augments=None),
    type='ImageClassifier')
optim_wrapper = dict(
    optimizer=dict(lr=0.001, type='AdamW', weight_decay=0.05),
    paramwise_cfg=None)
param_scheduler = [
    dict(type='CosineAnnealingLR'),
]
randomness = dict(deterministic=False, seed=49)
resume = False
test_cfg = dict()
test_dataloader = dict(
    batch_size=8,
    collate_fn=dict(type='default_collate'),
    dataset=dict(
        ann_file=None,
        classes=None,
        data_prefix='/workspace/tao-experiments/test',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(scale=224, type='Resize'),
            dict(crop_size=224, type='CenterCrop'),
            dict(type='PackInputs'),
        ],
        type='ImageNet'),
    num_workers=4,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
test_evaluator = dict(topk=(1, ), type='Accuracy')
train_cfg = dict(by_epoch=True, max_epochs=40, val_interval=1)
train_dataloader = dict(
    batch_size=8,
    collate_fn=dict(type='default_collate'),
    dataset=dict(
        classes=None,
        data_prefix='/workspace/tao-experiments/train',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(scale=224, type='RandomResizedCrop'),
            dict(direction='horizontal', prob=0.5, type='RandomFlip'),
            dict(
                brightness=0.4,
                contrast=0.4,
                saturation=0.4,
                type='ColorJitter'),
            dict(erase_prob=0.3, type='RandomErasing'),
            dict(type='PackInputs'),
        ],
        type='ImageNet'),
    num_workers=4,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
val_cfg = dict()
val_dataloader = dict(
    batch_size=8,
    collate_fn=dict(type='default_collate'),
    dataset=dict(
        ann_file=None,
        classes=None,
        data_prefix='/workspace/tao-experiments/val',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(scale=224, type='Resize'),
            dict(crop_size=224, type='CenterCrop'),
            dict(type='PackInputs'),
        ],
        type='ImageNet'),
    num_workers=4,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
val_evaluator = dict(topk=(1, ), type='Accuracy')
vis_backends = [
    dict(type='LocalVisBackend'),
]
visualizer = dict(
    type='UniversalVisualizer', vis_backends=[
        dict(type='LocalVisBackend'),
    ])
work_dir = '/workspace/tao-experiments/classification_pyt/output'

08/28 00:33:34 - mmengine - INFO - Because batch augmentations are enabled, the data preprocessor automatically enables the `to_onehot` option to generate one-hot format labels.
No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
Error executing job with overrides: []Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 88, in main
    run_experiment(cfg)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 73, in run_experiment
    runner = Runner.from_cfg(train_cfg)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 462, in from_cfg
    runner = cls(
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 431, in __init__
    self.model = self.wrap_model(
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 898, in wrap_model
    model = MMDistributedDataParallel(
  File "/usr/local/lib/python3.10/dist-packages/mmengine/model/wrappers/distributed.py", line 93, in __init__
    super().__init__(module=module, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 798, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 269, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Failed to CUDA host alloc 2147483648 bytes

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
E0828 00:33:37.994000 140492393055360 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 1) local_rank: 0 (pid: 363) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:  time      : 2024-08-28_00:33:37
  host      : 5ba3b212b4af
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 363)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Could you debug inside the docker?
Step:

  1. Open a terminal,
    $ tao model classification_pyt run /bin/bash
  2. Then inside docker, run
    $ python -m torch.utils.collect_env
  3. Then still inside the docker, run the training again.
    $ export NCCL_DEBUG=INFO
    $ export TORCH_DISTRIBUTED_DEBUG=INFO
    $ export TORCH_SHOW_CPP_STACKTRACES=1
    $ classification_pyt train -e spec_pyt.yaml

Running python -m torch.utils.collect_env:

/usr/lib/python3.10/runpy.py:126: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
Collecting environment information...
PyTorch version: 2.3.0a0+6ddf5cf85e.nv24.04
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.0
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA RTX 2000 Ada Generation Laptop GPU
Nvidia driver version: 538.27
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             28
On-line CPU(s) list:                0-27
Vendor ID:                          GenuineIntel
Model name:                         13th Gen Intel(R) Core(TM) i7-13850HX
CPU family:                         6
Model:                              183
Thread(s) per core:                 2
Core(s) per socket:                 14
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           4608.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves umip gfni vaes vpclmulqdq rdpid fsrm md_clear flush_l1d arch_capabilities
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          672 KiB (14 instances)
L1i cache:                          448 KiB (14 instances)
L2 cache:                           28 MiB (14 instances)
L3 cache:                           30 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Unknown: No mitigations
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] flake8==7.0.0
[pip3] flake8-comprehensions==3.14.0
[pip3] mypy==1.8.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.4
[pip3] nvidia-tao-pytorch==5.2.0.2272.dev0
[pip3] onnx==1.15.0
[pip3] onnx-graphsurgeon==0.3.27
[pip3] onnx-simplifier==0.4.35
[pip3] onnxoptimizer==0.3.13
[pip3] onnxruntime==1.17.0
[pip3] onnxsim==0.4.35
[pip3] open-clip-torch==2.24.0
[pip3] optree==0.11.0
[pip3] pytorch-lightning==2.2.0
[pip3] pytorch-metric-learning==1.7.1
[pip3] pytorch-msssim==1.0.0
[pip3] pytorch-quantization==2.1.2
[pip3] pytorch-triton==3.0.0+a9bc1a364
[pip3] torch==2.3.0a0+6ddf5cf85e.nv24.4
[pip3] torch-pruning==1.2.2
[pip3] torch-tensorrt==2.3.0a0
[pip3] torchdata==0.7.1a0
[pip3] torchmetrics==0.10.3
[pip3] torchvision==0.18.0a0
[conda] Could not collect

Running classification_pyt train -e spec_pyt.yaml gives the following output:

Train results will be saved at: /workspace/tao-experiments/classification_pyt/output
08/29 11:58:10 - mmengine - INFO - 
------------------------------------------------------------
System environment:
    sys.platform: linux
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]    CUDA available: True
    MUSA available: False
    numpy_random_seed: 49
    GPU 0: NVIDIA RTX 2000 Ada Generation Laptop GPU
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.4, V12.4.131
    GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
    PyTorch: 2.3.0a0+6ddf5cf85e.nv24.04
    PyTorch compiling details: PyTorch built with:
  - GCC 11.2
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2021.1-Product Build 20201104 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash N/A)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.4
  - NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_72,code=sm_72;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_87,code=sm_87;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_90,code=compute_90
  - CuDNN 90.1
  - Magma 2.6.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.4, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/c++, CXX_FLAGS=-fno-gnu-unique -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

    TorchVision: 0.18.0a0
    OpenCV: 4.7.0
    MMEngine: 0.10.4

Runtime environment:
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: 49
    deterministic: False
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 1
------------------------------------------------------------

08/29 11:58:10 - mmengine - INFO - Config:
auto_scale_lr = dict(base_batch_size=1024)
custom_hooks = [
    dict(momentum=4e-05, priority='ABOVE_NORMAL', type='EMAHook'),
]
data_preprocessor = dict(
    mean=[
        123.675,
        116.28,
        103.53,
    ],
    num_classes=5,
    std=[
        58.395,
        57.12,
        57.375,
    ],
    to_rgb=True)
dataset_type = 'ImageNet'
default_hooks = dict(
    checkpoint=dict(interval=1, type='CheckpointHook'),
    logger=dict(interval=500, type='TaoTextLoggerHook'),
    param_scheduler=dict(type='ParamSchedulerHook'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    timer=dict(type='IterTimerHook'),
    visualization=dict(enable=False, type='VisualizationHook'))
default_scope = 'mmpretrain'
env_cfg = dict(
    cudnn_benchmark=False,
    dist_cfg=dict(backend='nccl'),
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
find_unused_parameters = False
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
model = dict(
    backbone=dict(
        drop_path=0.1,
        freeze=False,
        init_cfg=None,
        pretrained='',
        type='fan_small_12_p4_hybrid'),
    head=dict(
        binary=False,
        in_channels=384,
        loss=dict(loss_weight=1.0, type='CrossEntropyLoss', use_soft=False),
        num_classes=5,
        type='TAOLinearClsHead'),
    neck=None,
    train_cfg=dict(augments=None),
    type='ImageClassifier')
optim_wrapper = dict(
    optimizer=dict(lr=0.001, type='AdamW', weight_decay=0.05),
    paramwise_cfg=None)
param_scheduler = [
    dict(type='CosineAnnealingLR'),
]
randomness = dict(deterministic=False, seed=49)
resume = False
test_cfg = dict()
test_dataloader = dict(
    batch_size=8,
    collate_fn=dict(type='default_collate'),
    dataset=dict(
        ann_file=None,
        classes=None,
        data_prefix='/workspace/tao-experiments/test',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(scale=224, type='Resize'),
            dict(crop_size=224, type='CenterCrop'),
            dict(type='PackInputs'),
        ],
        type='ImageNet'),
    num_workers=4,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
test_evaluator = dict(topk=(1, ), type='Accuracy')
train_cfg = dict(by_epoch=True, max_epochs=40, val_interval=1)
train_dataloader = dict(
    batch_size=8,
    collate_fn=dict(type='default_collate'),
    dataset=dict(
        classes=None,
        data_prefix='/workspace/tao-experiments/train',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(scale=224, type='RandomResizedCrop'),
            dict(direction='horizontal', prob=0.5, type='RandomFlip'),
            dict(
                brightness=0.4,
                contrast=0.4,
                saturation=0.4,
                type='ColorJitter'),
            dict(erase_prob=0.3, type='RandomErasing'),
            dict(type='PackInputs'),
        ],
        type='ImageNet'),
    num_workers=4,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
val_cfg = dict()
val_dataloader = dict(
    batch_size=8,
    collate_fn=dict(type='default_collate'),
    dataset=dict(
        ann_file=None,
        classes=None,
        data_prefix='/workspace/tao-experiments/val',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(scale=224, type='Resize'),
            dict(crop_size=224, type='CenterCrop'),
            dict(type='PackInputs'),
        ],
        type='ImageNet'),
    num_workers=4,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
val_evaluator = dict(topk=(1, ), type='Accuracy')
vis_backends = [
    dict(type='LocalVisBackend'),
]
visualizer = dict(
    type='UniversalVisualizer', vis_backends=[
        dict(type='LocalVisBackend'),
    ])
work_dir = '/workspace/tao-experiments/classification_pyt/output'

08/29 11:58:10 - mmengine - INFO - Because batch augmentations are enabled, the data preprocessor automatically enables the `to_onehot` option to generate one-hot format labels.
No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
c651a70756c4:776:776 [0] NCCL INFO cudaDriverVersion 12020NCCL version 2.21.5+cuda12.4
c651a70756c4:776:883 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24c651a70756c4:776:883 [0] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.els, 32 p2p channels, 32 p2p channels per peer1/-1->0->-1 [31] -1/-1/-1->0->-1
c651a70756c4:776:883 [0] include/alloc.h:28 NCCL WARN Cuda failure 'out of memory'
[rank0]:[W Module.cpp:159] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
Error executing job with overrides: []Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 88, in main
    run_experiment(cfg)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 73, in run_experiment
    runner = Runner.from_cfg(train_cfg)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 462, in from_cfg
    runner = cls(
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 431, in __init__
    self.model = self.wrap_model(
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 898, in wrap_model
    model = MMDistributedDataParallel(
  File "/usr/local/lib/python3.10/dist-packages/mmengine/model/wrappers/distributed.py", line 93, in __init__
    super().__init__(module=module, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 798, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 269, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Failed to CUDA host alloc 2147483648 bytes
Exception raised from getNCCLComm at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970 (most recent call first):
C++ CapturedTraceback:
#4 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#5 c10::DistBackendError::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from :0
#6 c10d::ProcessGroupNCCL::getNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::Device&, c10d::OpType, int, bool) [clone .cold] from ProcessGroupNCCL.cpp:0#7 c10d::ProcessGroupNCCL::allgather(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&) from ??:0
#8 c10d::ops::(anonymous namespace)::allgather_CUDA(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long) from Ops.cpp:0
#9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long), std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
#10 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
#11 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long) from :0
#12 c10d::ProcessGroup::allgather(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&) from :0
#13 c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::weak_ptr<c10d::Logger> > const&) from ??:0
#14 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#87}, void, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#87}&&, void (*)(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
#15 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#16 PyObject_CallFunctionObjArgs from ??:0
#17 _PyObject_MakeTpCall from ??:0
#18 _PyEval_EvalFrameDefault from ??:0
#19 _PyFunction_Vectorcall from ??:0
#20 _PyEval_EvalFrameDefault from ??:0
#21 PyMethod_New from ??:0
#22 PyObject_Call from ??:0
#23 _PyEval_EvalFrameDefault from ??:0
#24 _PyFunction_Vectorcall from ??:0
#25 _PyObject_FastCallDictTstate from ??:0
#26 _PyStack_AsDict from ??:0
#27 _PyObject_MakeTpCall from ??:0
#28 _PyEval_EvalFrameDefault from ??:0
#29 _PyFunction_Vectorcall from ??:0
#30 _PyEval_EvalFrameDefault from ??:0
#31 _PyFunction_Vectorcall from ??:0
#32 _PyObject_FastCallDictTstate from ??:0
#33 _PyStack_AsDict from ??:0
#34 _PyObject_MakeTpCall from ??:0
#35 PyObject_Call from ??:0
#36 _PyEval_EvalFrameDefault from ??:0
#37 PyMethod_New from ??:0
#38 _PyEval_EvalFrameDefault from ??:0
#39 _PyFunction_Vectorcall from ??:0
#40 _PyEval_EvalFrameDefault from ??:0
#41 _PyFunction_Vectorcall from ??:0
#42 _PyEval_EvalFrameDefault from ??:0
#43 _PyFunction_Vectorcall from ??:0
#44 _PyEval_EvalFrameDefault from ??:0
#45 _PyFunction_Vectorcall from ??:0
#46 _PyEval_EvalFrameDefault from ??:0
#47 PyMethod_New from ??:0
#48 _PyEval_EvalFrameDefault from ??:0
#49 _PyFunction_Vectorcall from ??:0
#50 _PyEval_EvalFrameDefault from ??:0
#51 _PyFunction_Vectorcall from ??:0
#52 _PyEval_EvalFrameDefault from ??:0
#53 _PyFunction_Vectorcall from ??:0
#54 _PyEval_EvalFrameDefault from ??:0
#55 _PyFunction_Vectorcall from ??:0
#56 _PyEval_EvalFrameDefault from ??:0
#57 _PyFunction_Vectorcall from ??:0
#58 _PyEval_EvalFrameDefault from ??:0
#59 _PyArg_ParseTuple_SizeT from ??:0
#60 PyEval_EvalCode from ??:0
#61 PyUnicode_Tailmatch from ??:0
#62 PyInit__collections from ??:0
#63 PyUnicode_Tailmatch from ??:0
#64 _PyRun_SimpleFileObject from ??:0
#65 _PyRun_AnyFileObject from ??:0
#66 Py_RunMain from ??:0
#67 Py_BytesMain from ??:0
#68 __libc_init_first from ??:0
#69 __libc_start_main from ??:0
#70 _start from ??:0


Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[W Module.cpp:159] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
E0829 11:58:15.628000 140256244061312 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 1) local_rank: 0 (pid: 776) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:  time      : 2024-08-29_11:58:15
  host      : c651a70756c4
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 776)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2024-08-29 11:58:15,784 - TAO Toolkit - root - INFO] Sending telemetry data.
[2024-08-29 11:58:15,785 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2024-08-29 11:58:15,785 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'classification_pyt', 'gpu': ['NVIDIA-RTX-2000-Ada-Generation-Laptop-GPU'], 'success': False, 'time_lapsed': 10} to https://api.tao.ngc.nvidia.com.
[2024-08-29 11:58:16,503 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2024-08-29 11:58:16,504 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2024-08-29 11:58:16,504 - TAO Toolkit - root - WARNING] Execution status: FAIL

To narrow down, please run below inside the docker to check if nccl-tests works.
$ export NCCL_DEBUG=INFO
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

@Morganh
I have done as you asked:

Running this command ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1 I get the following output:

005c099e3a16:5444:5449 [0] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
005c099e3a16:5444:5449 [0] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.

005c099e3a16:5444:5449 [0] include/alloc.h:28 NCCL WARN Cuda failure 'out of memory'

005c099e3a16:5444:5449 [0] include/alloc.h:32 NCCL WARN Failed to CUDA host alloc 2147483648 bytes
005c099e3a16:5444:5449 [0] NCCL INFO init.cc:430 -> 1
005c099e3a16:5444:5449 [0] NCCL INFO init.cc:1401 -> 1
005c099e3a16:5444:5449 [0] NCCL INFO init.cc:1548 -> 1
005c099e3a16:5444:5449 [0] NCCL INFO group.cc:64 -> 1 [Async thread]
005c099e3a16:5444:5444 [0] NCCL INFO group.cc:418 -> 1
005c099e3a16:5444:5444 [0] NCCL INFO group.cc:95 -> 1
005c099e3a16:5444:5444 [0] NCCL INFO init.cc:1892 -> 1
005c099e3a16: Test NCCL failure common.cu:1005 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. 005c099e3a16 pid 5444: Test failure common.cu:891

It’s trying to allocate 2GB of VRAM but it’s failing for some odd reason.

Running the second build command I get the following:

taotoolkituser@005c099e3a16:/workspace/nccl-tests$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
005c099e3a16: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. 005c099e3a16 pid 5454: Test failure common.cu:89

Running nvidia-smi my GPU should have plenty of VRAM for this:

taotoolkituser@005c099e3a16:/workspace/nccl-tests$ nvidia-smi
Thu Aug 29 16:49:06 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.160                Driver Version: 538.27       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX 2000 Ada Gene...    On  | 00000000:01:00.0  On |                  N/A |
| N/A   58C    P8               7W /  55W |   1981MiB /  8188MiB |     11%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

So, the nccl-test is also failed. Could you try more as below?
root@debug:/workspace# export NCCL_P2P_LEVEL=NVL
root@debug:/workspace# export NCCL_DEBUG=TRACE
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

@Morganh
I did as you requested, I am getting the following output:

taotoolkituser@005c099e3a16:/workspace/nccl-tests$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   5455 on 005c099e3a16 device  0 [0x01] NVIDIA RTX 2000 Ada Generation Laptop GPU
005c099e3a16:5455:5455 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0>
005c099e3a16:5455:5455 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.21.5+cuda12.4
005c099e3a16:5455:5460 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
005c099e3a16:5455:5460 [0] NCCL INFO P2P plugin IBext_v8
005c099e3a16:5455:5460 [0] NCCL INFO NET/IB : No device found.
005c099e3a16:5455:5460 [0] NCCL INFO NET/IB : No device found.
005c099e3a16:5455:5460 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
005c099e3a16:5455:5460 [0] NCCL INFO Using non-device net plugin version 0
005c099e3a16:5455:5460 [0] NCCL INFO Using network Socket
005c099e3a16:5455:5460 [0] NCCL INFO ncclCommInitRank comm 0x5627fa3ca070 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 1000 commId 0x3dc9ac4069848d66 - Init START
005c099e3a16:5455:5460 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
005c099e3a16:5455:5460 [0] NCCL INFO comm 0x5627fa3ca070 rank 0 nRanks 1 nNodes 1 localRanks 1 localRank 0 MNNVL 0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 00/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 01/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 02/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 03/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 04/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 05/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 06/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 07/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 08/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 09/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 10/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 11/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 12/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 13/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 14/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 15/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 16/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 17/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 18/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 19/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 20/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 21/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 22/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 23/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 24/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 25/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 26/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 27/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 28/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 29/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 30/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Channel 31/32 :    0
005c099e3a16:5455:5460 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
005c099e3a16:5455:5460 [0] NCCL INFO P2P Chunksize set to 131072
005c099e3a16:5455:5460 [0] NCCL INFO Connected all rings
005c099e3a16:5455:5460 [0] NCCL INFO Connected all trees
005c099e3a16:5455:5460 [0] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
005c099e3a16:5455:5460 [0] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.

005c099e3a16:5455:5460 [0] include/alloc.h:28 NCCL WARN Cuda failure 'out of memory'

005c099e3a16:5455:5460 [0] include/alloc.h:32 NCCL WARN Failed to CUDA host alloc 2147483648 bytes
005c099e3a16:5455:5460 [0] NCCL INFO init.cc:430 -> 1
005c099e3a16:5455:5460 [0] NCCL INFO init.cc:1401 -> 1
005c099e3a16:5455:5460 [0] NCCL INFO init.cc:1548 -> 1
005c099e3a16:5455:5460 [0] NCCL INFO group.cc:64 -> 1 [Async thread]
005c099e3a16:5455:5455 [0] NCCL INFO group.cc:418 -> 1
005c099e3a16:5455:5455 [0] NCCL INFO group.cc:95 -> 1
005c099e3a16:5455:5455 [0] NCCL INFO init.cc:1892 -> 1
005c099e3a16: Test NCCL failure common.cu:1005 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. 005c099e3a16 pid 5455: Test failure common.cu:891

Please open a new terminal and run below outside the docker.
$ sudo lspci -vvv | grep ACSCtl
$ dmesg | grep IOMMU

Then please follow TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - #27 by Morganh.

After rebooting, please run nccl-test again.

Also, to narrow down, please run nccl-test directly instead of tao docker.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.