V100 GPUs not recognised within the container

mahima1 · December 7, 2022, 11:14am

Please provide the following information when requesting support.

• Hardware: V100
• Network Type: speech_to_text_citrinet
• TLT Version: v3.22.05-py3
• How to reproduce the issue ?
COMMAND: speech_to_text_citrinet train -e specs/train_citrinet_bpe.yaml -g 1 -k $KEY -r results/train training_ds.manifest_filepath=data/an4_converted/train_manifest.json validation_ds.manifest_filepath=data/an4_converted/test_manifest.json trainer.max_epochs=1 training_ds.num_workers=4 validation_ds.num_workers=4 model.tokenizer.dir=data/an4/tokenizer_spe_unigram_v32

LOG:

[NeMo W 2022-12-07 10:46:16 nemo_logging:349] /home/jenkins/agent/workspace/tlt-pytorch-main-nightly/conv_ai/asr/speech_to_text_ctc/scripts/train.py:159: UserWarning: 
    'train_citrinet_bpe.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
[NeMo I 2022-12-07 10:46:17 tlt_logging:20] Experiment configuration:
    exp_manager:
      explicit_log_dir: results/train
      exp_dir: null
      name: trained-model
      version: null
      use_datetime_version: true
      resume_if_exists: true
      resume_past_end: false
      resume_ignore_no_checkpoint: true
      create_tensorboard_logger: false
      summary_writer_kwargs: null
      create_wandb_logger: false
      wandb_logger_kwargs: null
      create_checkpoint_callback: true
      checkpoint_callback_params:
        filepath: null
        dirpath: null
        filename: null
        monitor: val_loss
        verbose: true
        save_last: true
        save_top_k: 3
        save_weights_only: false
        mode: min
        every_n_epochs: 1
        prefix: null
        postfix: .tlt
        save_best_model: false
        always_save_nemo: false
        save_nemo_on_train_end: true
        model_parallel_size: null
      files_to_copy: null
      log_step_timing: true
      step_timing_kwargs:
        reduction: mean
        sync_cuda: false
        buffer_size: 1
    model:
      preprocessor:
        _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
        normalize: per_feature
        window_size: 0.02
        sample_rate: 16000
        window_stride: 0.01
        window: hann
        features: 64
        n_fft: 512
        frame_splicing: 1
        dither: 1.0e-05
        stft_conv: false
      spec_augment:
        _target_: nemo.collections.asr.modules.SpectrogramAugmentation
        rect_freq: 50
        rect_masks: 0
        rect_time: 120
      encoder:
        _target_: nemo.collections.asr.modules.ConvASREncoder
        feat_in: 64
        activation: relu
        conv_mask: true
        jasper:
        - filters: 128
          repeat: 1
          kernel:
          - 11
          stride:
          - 1
          dilation:
          - 1
          dropout: 0.0
          residual: true
          separable: true
          se: true
          se_context_size: -1
        - filters: 256
          repeat: 1
          kernel:
          - 13
          stride:
          - 1
          dilation:
          - 1
          dropout: 0.0
          residual: true
          separable: true
          se: true
          se_context_size: -1
        - filters: 256
          repeat: 1
          kernel:
          - 15
          stride:
          - 1
          dilation:
          - 1
          dropout: 0.0
          residual: true
          separable: true
          se: true
          se_context_size: -1
        - filters: 256
          repeat: 1
          kernel:
          - 17
          stride:
          - 1
          dilation:
          - 1
          dropout: 0.0
          residual: true
          separable: true
          se: true
          se_context_size: -1
        - filters: 256
          repeat: 1
          kernel:
          - 19
          stride:
          - 1
          dilation:
          - 1
          dropout: 0.0
          residual: true
          separable: true
          se: true
          se_context_size: -1
        - filters: 256
          repeat: 1
          kernel:
          - 21
          stride:
          - 1
          dilation:
          - 1
          dropout: 0.0
          residual: false
          separable: true
          se: true
          se_context_size: -1
        - filters: 1024
          repeat: 1
          kernel:
          - 1
          stride:
          - 1
          dilation:
          - 1
          dropout: 0.0
          residual: false
          separable: true
          se: true
          se_context_size: -1
      decoder:
        _target_: nemo.collections.asr.modules.ConvASRDecoder
        feat_in: 1024
        num_classes: -1
        vocabulary: []
      tokenizer:
        dir: data/an4/tokenizer_spe_unigram_v32
        type: bpe
      ctc_reduction: mean_batch
      log_prediction: true
    trainer:
      logger: false
      checkpoint_callback: false
      callbacks: null
      default_root_dir: null
      gradient_clip_val: 0.0
      process_position: 0
      num_nodes: 1
      num_processes: 1
      gpus: 1
      auto_select_gpus: false
      tpu_cores: null
      log_gpu_memory: null
      progress_bar_refresh_rate: 1
      enable_progress_bar: true
      overfit_batches: 0.0
      track_grad_norm: -1
      check_val_every_n_epoch: 1
      fast_dev_run: false
      accumulate_grad_batches: 1
      max_epochs: 1
      min_epochs: 1
      max_steps: null
      min_steps: null
      limit_train_batches: 1.0
      limit_val_batches: 1.0
      limit_test_batches: 1.0
      val_check_interval: 1.0
      flush_logs_every_n_steps: 100
      log_every_n_steps: 50
      accelerator: ddp
      sync_batchnorm: false
      precision: 32
      weights_summary: full
      weights_save_path: null
      num_sanity_val_steps: 2
      resume_from_checkpoint: null
      profiler: null
      benchmark: false
      deterministic: false
      reload_dataloaders_every_epoch: false
      auto_lr_find: false
      replace_sampler_ddp: true
      detect_anomaly: false
      terminate_on_nan: false
      auto_scale_batch_size: false
      prepare_data_per_node: true
      amp_backend: apex
      amp_level: O0
      plugins: null
      move_metrics_to_cpu: false
      multiple_trainloader_mode: max_size_cycle
      limit_predict_batches: 1.0
      stochastic_weight_avg: false
      gradient_clip_algorithm: norm
      max_time: null
      reload_dataloaders_every_n_epochs: 0
      ipus: null
      devices: null
      strategy: null
      enable_checkpointing: true
      enable_model_summary: true
    training_ds:
      manifest_filepath: data/an4_converted/train_manifest.json
      batch_size: 32
      sample_rate: 16000
      labels: null
      num_workers: 4
      pin_memory: true
      trim_silence: true
      shuffle: true
      max_duration: 16.7
      min_duration: null
      is_tarred: false
      tarred_audio_filepaths: null
      use_start_end_token: false
      shuffle_n: null
      bucketing_strategy: synced_randomized
      bucketing_batch_size: null
    validation_ds:
      manifest_filepath: data/an4_converted/test_manifest.json
      batch_size: 32
      sample_rate: 16000
      labels: null
      num_workers: 4
      pin_memory: true
      trim_silence: true
      shuffle: false
      max_duration: null
      min_duration: null
      is_tarred: false
      tarred_audio_filepaths: null
      use_start_end_token: false
      shuffle_n: null
      bucketing_strategy: synced_randomized
      bucketing_batch_size: null
    optim:
      name: adam
      lr: 0.1
      betas:
      - 0.9
      - 0.999
      weight_decay: 0.0001
      sched:
        name: CosineAnnealing
        warmup_steps: null
        warmup_ratio: 0.05
        min_lr: 1.0e-06
        last_epoch: -1
    encryption_key: '******'
    tlt_checkpoint_interval: 1
    early_stopping: null
    
Error executing job with overrides: ['exp_manager.explicit_log_dir=results/train', 'trainer.gpus=1', 'encryption_key=tlt_encode', 'training_ds.manifest_filepath=data/an4_converted/train_manifest.json', 'validation_ds.manifest_filepath=data/an4_converted/test_manifest.json', 'trainer.max_epochs=1', 'training_ds.num_workers=4', 'validation_ds.num_workers=4', 'model.tokenizer.dir=data/an4/tokenizer_spe_unigram_v32']
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/conv_ai/asr/speech_to_text_ctc/scripts/train.py", line 112, in main
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 38, in insert_env_defaults
    return fn(self, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 426, in __init__
    gpu_ids, tpu_cores = self._parse_devices(gpus, auto_select_gpus, tpu_cores)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1538, in _parse_devices
    gpu_ids = device_parser.parse_gpu_ids(gpus)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 89, in parse_gpu_ids
    return _sanitize_gpu_ids(gpus)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 151, in _sanitize_gpu_ids
    raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0]
 But your machine only has: []

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/conv_ai/asr/speech_to_text_ctc/scripts/train.py", line 159, in <module>
  File "/opt/conda/lib/python3.8/site-packages/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 251, in run_and_report
    assert mdl is not None
AssertionError

Hi,

I am using the nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.22.05-py3 image to train an ASR model. While the container works without any issue on a system with A100 GPUs, I am facing the above mentioned issue on a system with V100 GPUs. The nvidia-smi command works outside the container but within the container it doesn't give any output. Outside the container it includes the following information:
NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7

Please let me know the steps to be taken to enable the GPUs within the container.

Topic		Replies	Views
Effective PyTorch and CUDA DGX Spark / GB10 cudnn	23	9372	January 12, 2026
all CUDA-capable devices are busy or unavailable. What is wrong? cuDNN	10	10064	October 12, 2021
NGC pytorch docker container. The NVIDIA Driver was not detected Docker and NVIDIA Docker	0	1026	February 23, 2023
ERROR: No supported GPU(s) detected to run this container Docker and NVIDIA Docker	0	2195	October 30, 2019
No CUDA-capable device is detected - yolov4 TAO Toolkit	10	346	August 16, 2024
Cloud Vendor agnostic Pytorch CUDA docker image Frameworks (archived)	0	685	February 26, 2023
Pytorch yolov5 is failing on A100 GPU Frameworks (archived) pytorch	0	772	May 19, 2021
pycuda._driver.LogicError: cuInit failed: system not yet initialized TAO Toolkit	18	7208	October 12, 2021
Fail to run docker at NVIDIA Clara AGX Holoscan SDK ai	4	1498	September 20, 2023
RTX 4090 shows as "non-free GPU" when running NIM model in docker NVIDIA Nemotron nim	8	2559	October 21, 2024

V100 GPUs not recognised within the container

Related topics