Please provide the following information when requesting support.
• Hardware: V100
• Network Type: speech_to_text_citrinet
• TLT Version: v3.22.05-py3
• How to reproduce the issue ?
COMMAND: speech_to_text_citrinet train -e specs/train_citrinet_bpe.yaml -g 1 -k $KEY -r results/train training_ds.manifest_filepath=data/an4_converted/train_manifest.json validation_ds.manifest_filepath=data/an4_converted/test_manifest.json trainer.max_epochs=1 training_ds.num_workers=4 validation_ds.num_workers=4 model.tokenizer.dir=data/an4/tokenizer_spe_unigram_v32
LOG:
[NeMo W 2022-12-07 10:46:16 nemo_logging:349] /home/jenkins/agent/workspace/tlt-pytorch-main-nightly/conv_ai/asr/speech_to_text_ctc/scripts/train.py:159: UserWarning:
'train_citrinet_bpe.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
[NeMo I 2022-12-07 10:46:17 tlt_logging:20] Experiment configuration:
exp_manager:
explicit_log_dir: results/train
exp_dir: null
name: trained-model
version: null
use_datetime_version: true
resume_if_exists: true
resume_past_end: false
resume_ignore_no_checkpoint: true
create_tensorboard_logger: false
summary_writer_kwargs: null
create_wandb_logger: false
wandb_logger_kwargs: null
create_checkpoint_callback: true
checkpoint_callback_params:
filepath: null
dirpath: null
filename: null
monitor: val_loss
verbose: true
save_last: true
save_top_k: 3
save_weights_only: false
mode: min
every_n_epochs: 1
prefix: null
postfix: .tlt
save_best_model: false
always_save_nemo: false
save_nemo_on_train_end: true
model_parallel_size: null
files_to_copy: null
log_step_timing: true
step_timing_kwargs:
reduction: mean
sync_cuda: false
buffer_size: 1
model:
preprocessor:
_target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
normalize: per_feature
window_size: 0.02
sample_rate: 16000
window_stride: 0.01
window: hann
features: 64
n_fft: 512
frame_splicing: 1
dither: 1.0e-05
stft_conv: false
spec_augment:
_target_: nemo.collections.asr.modules.SpectrogramAugmentation
rect_freq: 50
rect_masks: 0
rect_time: 120
encoder:
_target_: nemo.collections.asr.modules.ConvASREncoder
feat_in: 64
activation: relu
conv_mask: true
jasper:
- filters: 128
repeat: 1
kernel:
- 11
stride:
- 1
dilation:
- 1
dropout: 0.0
residual: true
separable: true
se: true
se_context_size: -1
- filters: 256
repeat: 1
kernel:
- 13
stride:
- 1
dilation:
- 1
dropout: 0.0
residual: true
separable: true
se: true
se_context_size: -1
- filters: 256
repeat: 1
kernel:
- 15
stride:
- 1
dilation:
- 1
dropout: 0.0
residual: true
separable: true
se: true
se_context_size: -1
- filters: 256
repeat: 1
kernel:
- 17
stride:
- 1
dilation:
- 1
dropout: 0.0
residual: true
separable: true
se: true
se_context_size: -1
- filters: 256
repeat: 1
kernel:
- 19
stride:
- 1
dilation:
- 1
dropout: 0.0
residual: true
separable: true
se: true
se_context_size: -1
- filters: 256
repeat: 1
kernel:
- 21
stride:
- 1
dilation:
- 1
dropout: 0.0
residual: false
separable: true
se: true
se_context_size: -1
- filters: 1024
repeat: 1
kernel:
- 1
stride:
- 1
dilation:
- 1
dropout: 0.0
residual: false
separable: true
se: true
se_context_size: -1
decoder:
_target_: nemo.collections.asr.modules.ConvASRDecoder
feat_in: 1024
num_classes: -1
vocabulary: []
tokenizer:
dir: data/an4/tokenizer_spe_unigram_v32
type: bpe
ctc_reduction: mean_batch
log_prediction: true
trainer:
logger: false
checkpoint_callback: false
callbacks: null
default_root_dir: null
gradient_clip_val: 0.0
process_position: 0
num_nodes: 1
num_processes: 1
gpus: 1
auto_select_gpus: false
tpu_cores: null
log_gpu_memory: null
progress_bar_refresh_rate: 1
enable_progress_bar: true
overfit_batches: 0.0
track_grad_norm: -1
check_val_every_n_epoch: 1
fast_dev_run: false
accumulate_grad_batches: 1
max_epochs: 1
min_epochs: 1
max_steps: null
min_steps: null
limit_train_batches: 1.0
limit_val_batches: 1.0
limit_test_batches: 1.0
val_check_interval: 1.0
flush_logs_every_n_steps: 100
log_every_n_steps: 50
accelerator: ddp
sync_batchnorm: false
precision: 32
weights_summary: full
weights_save_path: null
num_sanity_val_steps: 2
resume_from_checkpoint: null
profiler: null
benchmark: false
deterministic: false
reload_dataloaders_every_epoch: false
auto_lr_find: false
replace_sampler_ddp: true
detect_anomaly: false
terminate_on_nan: false
auto_scale_batch_size: false
prepare_data_per_node: true
amp_backend: apex
amp_level: O0
plugins: null
move_metrics_to_cpu: false
multiple_trainloader_mode: max_size_cycle
limit_predict_batches: 1.0
stochastic_weight_avg: false
gradient_clip_algorithm: norm
max_time: null
reload_dataloaders_every_n_epochs: 0
ipus: null
devices: null
strategy: null
enable_checkpointing: true
enable_model_summary: true
training_ds:
manifest_filepath: data/an4_converted/train_manifest.json
batch_size: 32
sample_rate: 16000
labels: null
num_workers: 4
pin_memory: true
trim_silence: true
shuffle: true
max_duration: 16.7
min_duration: null
is_tarred: false
tarred_audio_filepaths: null
use_start_end_token: false
shuffle_n: null
bucketing_strategy: synced_randomized
bucketing_batch_size: null
validation_ds:
manifest_filepath: data/an4_converted/test_manifest.json
batch_size: 32
sample_rate: 16000
labels: null
num_workers: 4
pin_memory: true
trim_silence: true
shuffle: false
max_duration: null
min_duration: null
is_tarred: false
tarred_audio_filepaths: null
use_start_end_token: false
shuffle_n: null
bucketing_strategy: synced_randomized
bucketing_batch_size: null
optim:
name: adam
lr: 0.1
betas:
- 0.9
- 0.999
weight_decay: 0.0001
sched:
name: CosineAnnealing
warmup_steps: null
warmup_ratio: 0.05
min_lr: 1.0e-06
last_epoch: -1
encryption_key: '******'
tlt_checkpoint_interval: 1
early_stopping: null
Error executing job with overrides: ['exp_manager.explicit_log_dir=results/train', 'trainer.gpus=1', 'encryption_key=tlt_encode', 'training_ds.manifest_filepath=data/an4_converted/train_manifest.json', 'validation_ds.manifest_filepath=data/an4_converted/test_manifest.json', 'trainer.max_epochs=1', 'training_ds.num_workers=4', 'validation_ds.num_workers=4', 'model.tokenizer.dir=data/an4/tokenizer_spe_unigram_v32']
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 368, in <lambda>
lambda: hydra.run(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 110, in run
_ = ret.return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/conv_ai/asr/speech_to_text_ctc/scripts/train.py", line 112, in main
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 38, in insert_env_defaults
return fn(self, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 426, in __init__
gpu_ids, tpu_cores = self._parse_devices(gpus, auto_select_gpus, tpu_cores)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1538, in _parse_devices
gpu_ids = device_parser.parse_gpu_ids(gpus)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 89, in parse_gpu_ids
return _sanitize_gpu_ids(gpus)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 151, in _sanitize_gpu_ids
raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0]
But your machine only has: []
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/conv_ai/asr/speech_to_text_ctc/scripts/train.py", line 159, in <module>
File "/opt/conda/lib/python3.8/site-packages/nemo/core/config/hydra_runner.py", line 104, in wrapper
_run_hydra(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 367, in _run_hydra
run_and_report(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 251, in run_and_report
assert mdl is not None
AssertionError
Hi,
I am using the nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.22.05-py3
image to train an ASR model. While the container works without any issue on a system with A100 GPUs, I am facing the above mentioned issue on a system with V100 GPUs. The nvidia-smi command works outside the container but within the container it doesn't give any output.
Outside the container it includes the following information:
NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7
Please let me know the steps to be taken to enable the GPUs within the container.