Tao Text Classification Evaluate failing

meravleen · August 26, 2021, 10:54am

Executing the evaluate command for tao text classification after finetuning a pretrained model.

Refer: https://developer.nvidia.com/blog/building-and-deploying-conversational-ai-models-using-tao-toolkit/

I get the following error log
(taoenv) ubuntu@ip-172-31-14-240:~/jarvis_quickstart_v1.0.0-b.1$ tao text_classification evaluate -e /specs/nlp/text_classification/evaluate.yaml -r /results/nlp/text_classification/evaluate -m /results/nlp/text_classification/train/checkpoints/trained_model.tlt -g 1 -k $KEY test_ds.file_path=/data/sst2/test.tsv test_ds.batch_size=32
2021-08-26 09:57:07,576 [INFO] root: Registry: [‘nvcr.io’]
2021-08-26 09:57:07,896 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/ubuntu/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
[NeMo W 2021-08-26 09:57:16 experimental:27] Module <class ‘nemo.collections.nlp.modules.common.megatron.megatron_bert.MegatronBertEncoder’> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-08-26 09:57:20 experimental:27] Module <class ‘nemo.collections.nlp.modules.common.megatron.megatron_bert.MegatronBertEncoder’> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo I 2021-08-26 09:57:21 tlt_logging:20] Experiment configuration:
restore_from: /results/nlp/text_classification/train/checkpoints/trained_model.tlt
exp_manager:
explicit_log_dir: /results/nlp/text_classification/evaluate
exp_dir: null
name: null
version: null
use_datetime_version: true
resume_if_exists: false
resume_past_end: false
resume_ignore_no_checkpoint: false
create_tensorboard_logger: false
summary_writer_kwargs: null
create_wandb_logger: false
wandb_logger_kwargs: null
create_checkpoint_callback: false
checkpoint_callback_params:
filepath: null
monitor: val_loss
verbose: true
save_last: true
save_top_k: 3
save_weights_only: false
mode: auto
period: 1
prefix: null
postfix: .nemo
save_best_model: false
files_to_copy: null
trainer:
logger: false
checkpoint_callback: false
callbacks: null
default_root_dir: null
gradient_clip_val: 0.0
process_position: 0
num_nodes: 1
num_processes: 1
gpus: 1
auto_select_gpus: false
tpu_cores: null
log_gpu_memory: null
progress_bar_refresh_rate: 1
overfit_batches: 0.0
track_grad_norm: -1
check_val_every_n_epoch: 1
fast_dev_run: false
accumulate_grad_batches: 1
max_epochs: 1000
min_epochs: 1
max_steps: null
min_steps: null
limit_train_batches: 1.0
limit_val_batches: 1.0
limit_test_batches: 1.0
val_check_interval: 1.0
flush_logs_every_n_steps: 100
log_every_n_steps: 50
accelerator: ddp
sync_batchnorm: false
precision: 32
weights_summary: full
weights_save_path: null
num_sanity_val_steps: 2
truncated_bptt_steps: null
resume_from_checkpoint: null
profiler: null
benchmark: false
deterministic: false
reload_dataloaders_every_epoch: false
auto_lr_find: false
replace_sampler_ddp: true
terminate_on_nan: false
auto_scale_batch_size: false
prepare_data_per_node: true
amp_backend: native
amp_level: O2
test_ds:
file_path: /data/sst2/test.tsv
batch_size: 32
shuffle: false
num_samples: -1
num_workers: 3
drop_last: false
pin_memory: false
encryption_key: ‘********’

GPU available: True, used: True
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2021-08-26 09:57:21 exp_manager:380] Exp_manager is logging to /results/nlp/text_classification/evaluate, but it already exists.
[NeMo I 2021-08-26 09:57:21 exp_manager:194] Experiments will be logged at /results/nlp/text_classification/evaluate
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 198, in run_and_report
return func()
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 347, in
lambda: hydra.run(
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py”, line 107, in run
return run_job(
File “/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py”, line 127, in run_job
ret.return_value = task_function(task_cfg)
File “/tlt-nemo/nlp/text_classification/scripts/evaluate.py”, line 88, in main
File “/opt/conda/lib/python3.8/site-packages/nemo/core/classes/modelPT.py”, line 473, in restore_from
raise FileNotFoundError(f"Can’t find {restore_path}")
FileNotFoundError: Can’t find /results/nlp/text_classification/train/checkpoints/trained_model.tlt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/tlt-nemo/nlp/text_classification/scripts/evaluate.py”, line 101, in
File “/opt/conda/lib/python3.8/site-packages/nemo/core/config/hydra_runner.py”, line 98, in wrapper
_run_hydra(
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 346, in _run_hydra
run_and_report(
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 237, in run_and_report
assert mdl is not None
AssertionError
2021-08-26 09:57:22,889 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
(taoenv) ubuntu@ip-172-31-14-240:~/jarvis_quickstart_v1.0.0-b.1$ tao text_classification evaluate -e /specs/nlp/text_classification/evaluate.yaml -r /results/nlp/text_classification/evaluate -m /results/nlp/text_classification/train/checkpoints/trained-model.tlt -g 1 -k $KEY test_ds.file_path=/data/sst2/test.tsv test_ds.batch_size=32
2021-08-26 09:59:25,123 [INFO] root: Registry: [‘nvcr.io’]
2021-08-26 09:59:25,225 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/ubuntu/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
[NeMo W 2021-08-26 09:59:28 experimental:27] Module <class ‘nemo.collections.nlp.modules.common.megatron.megatron_bert.MegatronBertEncoder’> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-08-26 09:59:31 experimental:27] Module <class ‘nemo.collections.nlp.modules.common.megatron.megatron_bert.MegatronBertEncoder’> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo I 2021-08-26 09:59:32 tlt_logging:20] Experiment configuration:
restore_from: /results/nlp/text_classification/train/checkpoints/trained-model.tlt
exp_manager:
explicit_log_dir: /results/nlp/text_classification/evaluate
exp_dir: null
name: null
version: null
use_datetime_version: true
resume_if_exists: false
resume_past_end: false
resume_ignore_no_checkpoint: false
create_tensorboard_logger: false
summary_writer_kwargs: null
create_wandb_logger: false
wandb_logger_kwargs: null
create_checkpoint_callback: false
checkpoint_callback_params:
filepath: null
monitor: val_loss
verbose: true
save_last: true
save_top_k: 3
save_weights_only: false
mode: auto
period: 1
prefix: null
postfix: .nemo
save_best_model: false
files_to_copy: null
trainer:
logger: false
checkpoint_callback: false
callbacks: null
default_root_dir: null
gradient_clip_val: 0.0
process_position: 0
num_nodes: 1
num_processes: 1
gpus: 1
auto_select_gpus: false
tpu_cores: null
log_gpu_memory: null
progress_bar_refresh_rate: 1
overfit_batches: 0.0
track_grad_norm: -1
check_val_every_n_epoch: 1
fast_dev_run: false
accumulate_grad_batches: 1
max_epochs: 1000
min_epochs: 1
max_steps: null
min_steps: null
limit_train_batches: 1.0
limit_val_batches: 1.0
limit_test_batches: 1.0
val_check_interval: 1.0
flush_logs_every_n_steps: 100
log_every_n_steps: 50
accelerator: ddp
sync_batchnorm: false
precision: 32
weights_summary: full
weights_save_path: null
num_sanity_val_steps: 2
truncated_bptt_steps: null
resume_from_checkpoint: null
profiler: null
benchmark: false
deterministic: false
reload_dataloaders_every_epoch: false
auto_lr_find: false
replace_sampler_ddp: true
terminate_on_nan: false
auto_scale_batch_size: false
prepare_data_per_node: true
amp_backend: native
amp_level: O2
test_ds:
file_path: /data/sst2/test.tsv
batch_size: 32
shuffle: false
num_samples: -1
num_workers: 3
drop_last: false
pin_memory: false
encryption_key: ‘***’

GPU available: True, used: True
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2021-08-26 09:59:32 exp_manager:380] Exp_manager is logging to /results/nlp/text_classification/evaluate, but it already exists.
[NeMo I 2021-08-26 09:59:32 exp_manager:194] Experiments will be logged at /results/nlp/text_classification/evaluate
[NeMo W 2021-08-26 09:59:36 modelPT:193] Using /tmp/tmpzsinnlml/tokenizer.vocab_file instead of tokenizer.vocab_file.
Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.
[NeMo W 2021-08-26 09:59:36 modelPT:1202] World size can only be set by PyTorch Lightning Trainer.
[NeMo I 2021-08-26 09:59:48 text_classification_dataset:120] Read 1822 examples from /data/sst2/test.tsv.
[NeMo I 2021-08-26 09:59:48 text_classification_dataset:238] *** Example ***
[NeMo I 2021-08-26 09:59:48 text_classification_dataset:239] example 0: [‘1784’, ‘full’, ‘frontal’, ‘is’, ‘the’, ‘antidote’, ‘for’, ‘soderbergh’, ‘fans’, ‘who’, ‘think’, ‘he’, “'s”, ‘gone’, ‘too’, ‘commercial’, ‘since’, ‘his’, ‘two’, ‘oscar’, ‘nominated’, ‘films’, ‘in’]
[NeMo I 2021-08-26 09:59:48 text_classification_dataset:240] subtokens: [CLS] 1784 full frontal is the anti ##dote for so ##der ##berg ##h fans who think he ’ s gone too commercial since his two oscar nominated films in [SEP]
[NeMo I 2021-08-26 09:59:48 text_classification_dataset:241] input_ids: 101 16496 2440 19124 2003 1996 3424 23681 2005 2061 4063 4059 2232 4599 2040 2228 2002 1005 1055 2908 2205 3293 2144 2010 2048 7436 4222 3152 1999 102
[NeMo I 2021-08-26 09:59:48 text_classification_dataset:242] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[NeMo I 2021-08-26 09:59:48 text_classification_dataset:243] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[NeMo I 2021-08-26 09:59:48 text_classification_dataset:244] label: 2000
[NeMo I 2021-08-26 09:59:48 data_preprocessing:295] Some stats of the lengths of the sequences:
[NeMo I 2021-08-26 09:59:48 data_preprocessing:297] Min: 30 | Max: 30 | Mean: 30.0 | Median: 30.0
[NeMo I 2021-08-26 09:59:48 data_preprocessing:303] 75 percentile: 30.00
[NeMo I 2021-08-26 09:59:48 data_preprocessing:304] 99 percentile: 30.00
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
Added key: store_based_barrier_key:1 to store for rank: 0
Testing: 0it [00:00, ?it/s]/tmp/pip-req-build-_tx3iysr/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 198, in run_and_report
return func()
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 347, in
lambda: hydra.run(
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py”, line 107, in run
return run_job(
File “/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py”, line 127, in run_job
ret.return_value = task_function(task_cfg)
File “/tlt-nemo/nlp/text_classification/scripts/evaluate.py”, line 94, in main
File “/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py”, line 754, in test
results = self.__test_given_model(model, test_dataloaders)
File “/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py”, line 819, in __test_given_model
results = self.fit(model)
File “/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py”, line 472, in fit
results = self.accelerator_backend.train()
File “/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py”, line 152, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File “/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py”, line 307, in ddp_train
results = self.train_or_test()
File “/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py”, line 67, in train_or_test
results = self.trainer.run_test()
File “/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py”, line 661, in run_test
eval_loop_results, _ = self.run_evaluation(test_mode=True)
File “/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py”, line 608, in run_evaluation
output = self.evaluation_loop.evaluation_step(test_mode, batch, batch_idx, dataloader_idx)
File “/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py”, line 175, in evaluation_step
output = self.trainer.accelerator_backend.test_step(args)
File “/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py”, line 164, in test_step
return self._step(args)
File “/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py”, line 172, in _step
output = self.trainer.model(*args)
File “/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 881, in _call_impl
result = self.forward(*input, **kwargs)
File “/opt/conda/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py”, line 182, in forward
output = self.module.test_step(*inputs[0], **kwargs[0])
File “/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/text_classification/text_classification_model.py”, line 178, in test_step
return self.validation_step(batch, batch_idx)
File “/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/text_classification/text_classification_model.py”, line 145, in validation_step
tp, fn, fp, _ = self.classification_report(preds, labels)
File “/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 881, in _call_impl
result = self.forward(*input, **kwargs)
File “/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py”, line 154, in forward
self.update(*args, **kwargs)
File “/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py”, line 200, in wrapped_func
return update(*args, **kwargs)
File “/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/metrics/classification_report.py”, line 94, in update
TP.append((label_predicted == current_label)[label_predicted].sum())
RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/tlt-nemo/nlp/text_classification/scripts/evaluate.py”, line 101, in
File “/opt/conda/lib/python3.8/site-packages/nemo/core/config/hydra_runner.py”, line 98, in wrapper
_run_hydra(
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 346, in _run_hydra
run_and_report(
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 237, in run_and_report
assert mdl is not None
AssertionError
Exception ignored in: <function tqdm.del at 0x7f54b2d8f550>
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/site-packages/tqdm/std.py”, line 1150, in del
File “/opt/conda/lib/python3.8/site-packages/tqdm/std.py”, line 1363, in close
File “/opt/conda/lib/python3.8/site-packages/tqdm/std.py”, line 1542, in display
File “/opt/conda/lib/python3.8/site-packages/tqdm/std.py”, line 1153, in repr
File “/opt/conda/lib/python3.8/site-packages/tqdm/std.py”, line 1503, in format_dict
TypeError: cannot unpack non-iterable NoneType object
terminate called after throwing an instance of ‘c10::Error’
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at …/c10/cuda/CUDACachingAllocator.cpp:716 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7f5504e7e44c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xfa (0x7f5504e450b4 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x987 (0x7f5504ebcf97 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x5c (0x7f5504e647dc in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x8d2572 (0x7f554ccef572 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x8d2605 (0x7f554ccef605 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #32: __libc_start_main + 0xf3 (0x7f5577d390b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)

2021-08-26 09:59:51,373 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
(taoenv) ubuntu@ip-172-31-14-240:~/jarvis_quickstart_v1.0.0-b.1$ python3 --version
Python 3.6.9

Here is the output of nvidia-smi

meravleen · August 26, 2021, 11:11am

Why this is happening and how to fix this.

tvarshney · August 26, 2021, 6:39pm

Hi @meravleen,

We have a dedicated forum for TAO which would be a better place to look for answers to this specific issue: TAO Toolkit - NVIDIA Developer Forums

nadeemm · August 27, 2021, 5:18pm

I shall move this post to the TAO Toolkit forums.

Morganh · August 31, 2021, 3:08am

Duplicated topic as Tao Text Classification Evaluate failing - #15 by meravleen .