Description
I want to evaluate the model visual changenet classification with our data. but it can’t pass the training process and log is as following. Could you help to clarify what the problem is (invalid multinomial distribution (sum of probabilities <= 0))? Thanks a lot.
!tao model visual_changenet evaluate
-e $SPECS_DIR/experiment_classify.yaml
evaluate.checkpoint=$RESULTS_DIR/train/changenet_classify.pth
Environment
ubuntn 20.04
TAO toolkit 5.1.0
Log
Train model
2023-11-15 02:49:13,538 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2023-11-15 02:49:13,811 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.1.0-pyt
2023-11-15 02:49:13,847 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 262:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/wistron/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
2023-11-15 02:49:13,847 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
INFO: Created a temporary directory at /tmp/tmpw7ylm3pa
INFO: Writing /tmp/tmpw7ylm3pa/_remote_module_non_scriptable.py
INFO: generated new fontManager
sys:1: UserWarning:
‘experiment_classify.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
:107: UserWarning:
‘experiment_classify.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
Train results will be saved at: /results/train
Experiment configuration:
model:
backbone:
type: fan_small_12_p4_hybrid
feat_downsample: false
pretrained_backbone_path: null
decode_head:
in_channels:
- 128
- 256
- 384
- 384
in_index:
- 0
- 1
- 2
- 3
feature_strides:
- 4
- 8
- 16
- 16
align_corners: false
decoder_params:
embed_dim: 256
classify:
train_margin_euclid: 2.0
eval_margin: 0.005
embedding_vectors: 5
embed_dec: 30
learnable_difference_modules: 4
difference_module: euclidean
dataset:
segment:
root_dir: ???
label_transform: norm
data_name: LEVIR
dataset: CNDataset
multi_scale_train: true
multi_scale_infer: false
num_classes: 2
img_size: 256
batch_size: 8
workers: 2
shuffle: true
image_folder_name: A
change_image_folder_name: B
list_folder_name: list
annotation_folder_name: label
augmentation:
random_flip:
vflip_probability: 0.5
hflip_probability: 0.5
enable: true
random_rotate:
rotate_probability: 0.5
angle_list:
- 90.0
- 180.0
- 270.0
enable: true
random_color:
brightness: 0.3
contrast: 0.3
saturation: 0.3
hue: 0.3
enable: true
with_scale_random_crop:
scale_range:
- 1.0
- 1.2
enable: true
with_random_blur: true
with_random_crop: true
mean:
- 0.5
- 0.5
- 0.5
std:
- 0.5
- 0.5
- 0.5
train_split: train
validation_split: val
test_split: test
predict_split: test
label_suffix: .png
color_map: null
classify:
train_dataset:
csv_path: /data/train_combined.csv
images_dir: /data/train_data/256
validation_dataset:
csv_path: /data/valid_combined.csv
images_dir: /data/test_data/256
test_dataset:
csv_path: /data/valid_combined.csv
images_dir: /data/test_data/256
infer_dataset:
csv_path: /data/valid_combined.csv
images_dir: /data/test_data/256
image_ext: .jpg
batch_size: 5
workers: 4
fpratio_sampling: 0.2
num_input: 1
input_map: null
grid_map:
x: 2
‘y’: 2
concat_type: linear
output_shape:
- 256
- 256
augmentation_config:
rgb_input_mean:
- 0.485
- 0.456
- 0.406
rgb_input_std:
- 0.229
- 0.224
- 0.225
num_classes: 2
train:
optim:
monitor_name: val_loss
optim: adamw
lr: 5.0e-05
policy: linear
momentum: 0.9
weight_decay: 0.01
num_epochs: 30
num_nodes: 1
val_interval: 1
checkpoint_interval: 1
pretrained_model_path: /results/pretrained/visual_changenet_classification_vvisual_changenet_nvpcb_trainable_v1.0/changenet_classifier.pth
resume_training_checkpoint_path: null
results_dir: ${results_dir}/train
classify:
loss: contrastive
cls_weight:
- 1.0
- 10.0
segment:
loss: ce
weights:
- 0.5
- 0.5
- 0.5
- 0.8
- 1.0
tensorboard:
enabled: true
infrequent_logging_frequency: 1
evaluate:
num_gpus: 1
checkpoint: ${results_dir}/train/changenet_classify.pth
results_dir: null
vis_after_n_batches: 16
trt_engine: ${results_dir}/gen_trt_engine/changenet-classify.trt
batch_size: ${dataset.classify.batch_size}
inference:
num_gpus: 1
checkpoint: ${results_dir}/train/changenet_classify.pth
results_dir: null
vis_after_n_batches: 1
trt_engine: ${results_dir}/gen_trt_engine/changenet-classify.trt
batch_size: ${dataset.classify.batch_size}
gpu_id: 0
export:
results_dir: null
gpu_id: 0
checkpoint: ${results_dir}/train/changenet_classify.pth
onnx_file: ${results_dir}/export/changenet-classify.onnx
on_cpu: false
input_channel: 3
input_width: 128
input_height: 512
opset_version: 12
batch_size: ${dataset.classify.batch_size}
verbose: false
gen_trt_engine:
results_dir: null
gpu_id: 0
onnx_file: ${results_dir}/export/changenet-classify.onnx
trt_engine: ${results_dir}/gen_trt_engine/changenet-classify.trt
input_channel: 3
input_width: 128
input_height: 512
opset_version: 12
batch_size: ${dataset.classify.batch_size}
verbose: false
tensorrt:
data_type: FP32
workspace_size: 1024
min_batch_size: 1
opt_batch_size: 10
max_batch_size: 10
calibration:
cal_image_dir: ???
cal_cache_file: ???
cal_batch_size: 1
cal_batches: 1
encryption_key: ‘********’
results_dir: /results/train
num_gpus: 1
task: classify
:245: UserWarning: Log file already exists at /results/train/status.json
Number of output classes: 2
Total Parameters: 76331605
Trainable Parameters: 76331605
/usr/local/lib/python3.8/dist-packages/torchmetrics/utilities/prints.py:36: UserWarning: Torchmetrics v0.9 introduced a new argument class property called full_state_update
that has
not been set for this class (AOIMetrics). The property determines if update
by
default needs access to the full metric state. If this is not the case, significant speedups can be
achieved and we recommend setting this to False
.
We provide an checking function
from torchmetrics.utilities import check_forward_full_state_property
that can be used to check if the full_state_update=True
(old and potential slower behaviour,
default for now) or if full_state_update=False
can be used safely.
warnings.warn(*args, **kwargs)
Tensorboard logging enabled.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Setting resume checkpoint to None
Results directory /results/train Checkpoint Interval 1
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:604: UserWarning: Checkpoint directory /results/train exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
0 | model | ChangeNetClassify | 76.3 M
1 | criterion | ContrastiveLoss | 0
2 | train_metrics | AOIMetrics | 0
3 | val_metrics | AOIMetrics | 0
76.3 M Trainable params
0 Non-trainable params
76.3 M Total params
305.326 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]Using 1 input types and linear type 1 X 1 for comparison
Number of steps for validation: 1930
Using 1 input types and linear type 1 X 1 for comparison
Sampling Defective components at 00.00:1 rate
Number of steps for training: 17305
Training: 0it [00:00, ?it/s]Starting Training Loop.
Epoch 0: 0%| | 0/19235 [00:00<?, ?it/s]invalid multinomial distribution (sum of probabilities <= 0)
Error executing job with overrides: [‘train.pretrained_model_path=/results/pretrained/visual_changenet_classification_vvisual_changenet_nvpcb_trainable_v1.0/changenet_classifier.pth’]
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/visual_changenet/scripts/train.py”, line 236, in main
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/visual_changenet/scripts/train.py”, line 218, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/visual_changenet/scripts/train.py”, line 199, in run_experiment
trainer.fit(model, ckpt_path=resume_ckpt or None)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 603, in fit
call._call_and_handle_interrupt(
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py”, line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1098, in _run
results = self._run_stage()
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1177, in _run_stage
self._run_train()
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1200, in _run_train
self.fit_loop.run()
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/fit_loop.py”, line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py”, line 194, in run
self.on_run_start(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py”, line 161, in on_run_start
_ = iter(data_fetcher) # creates the iterator inside the fetcher
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/fetching.py”, line 179, in iter
self._apply_patch()
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/fetching.py”, line 120, in _apply_patch
apply_to_collections(self.loaders, self.loader_iters, (Iterator, DataLoader), _apply_patch_fn)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/fetching.py”, line 156, in loader_iters
return self.dataloader_iter.loader_iters
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/supporters.py”, line 555, in loader_iters
self._loader_iters = self.create_loader_iters(self.loaders)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/supporters.py”, line 595, in create_loader_iters
return apply_to_collection(loaders, Iterable, iter, wrong_dtype=(Sequence, Mapping))
File “/usr/local/lib/python3.8/dist-packages/lightning_utilities/core/apply_func.py”, line 51, in apply_to_collection
return function(data, *args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 443, in iter
return self._get_iterator()
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 389, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 1086, in init
self._reset(loader, first_iter=True)
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 1119, in _reset
self._try_put_index()
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 1353, in _try_put_index
index = self._next_index()
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 625, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py”, line 247, in iter
batch = [next(sampler_iter) for _ in range(self.batch_size)]
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py”, line 247, in
batch = [next(sampler_iter) for _ in range(self.batch_size)]
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py”, line 203, in iter
rand_tensor = torch.multinomial(self.weights, self.num_samples, self.replacement, generator=self.generator)
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 0: 0%| | 0/19235 [00:01<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 376) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==1.14.0a0+44dac51’, ‘console_scripts’, ‘torchrun’)())
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 346, in wrapper
return f(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py”, line 762, in main
run(args)
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py”, line 753, in run
elastic_launch(
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py”, line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/visual_changenet/scripts/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-11-15_02:49:39
host : a08277bdef16
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 376)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.1 documentation
Execution status: FAIL
2023-11-15 02:49:47,285 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.