Is the latest version tlt3.0? I will try to install latest docker and try it out. Thanks for your advice.
I guess the reason for the error is that my first trained model has 43 dimensions and I put this (.tlt) file into the second process I want to train without processing it, which is only 38 classes.
Here is my full log:
To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
Using TensorFlow backend.
2021-11-23 08:12:22.021970: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-11-23 08:12:24.390643: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-11-23 08:12:24.413575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:65:00.0
2021-11-23 08:12:24.413618: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-11-23 08:12:24.414815: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-11-23 08:12:24.415855: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-11-23 08:12:24.416171: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-11-23 08:12:24.417494: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-11-23 08:12:24.418516: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-11-23 08:12:24.421681: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-11-23 08:12:24.422604: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-11-23 08:12:24.422635: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-11-23 08:12:24.765776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-11-23 08:12:24.765819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-11-23 08:12:24.765825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-11-23 08:12:24.767037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9859 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2021-11-23 08:12:24,768 [INFO] /usr/local/lib/python3.6/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from /workspace/tlt_demo/ssd/tlt_specs/ssd_train_resnet18_kitti_buffet_1021.txt
2021-11-23 08:12:24,774 [INFO] iva.ssd.scripts.train: Loading pretrained weights. This may take a while...
2021-11-23 08:12:24,882 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2021-11-23 08:12:24,882 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2021-11-23 08:12:24,882 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2021-11-23 08:12:24,882 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 16, io threads: 32, compute threads: 16, buffered batches: 4
2021-11-23 08:12:24,883 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 303, number of sources: 1, batch size per gpu: 8, steps: 38
2021-11-23 08:12:24,987 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2021-11-23 08:12:25.015316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:65:00.0
2021-11-23 08:12:25.015357: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-11-23 08:12:25.015396: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-11-23 08:12:25.015410: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-11-23 08:12:25.015422: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-11-23 08:12:25.015436: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-11-23 08:12:25.015448: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-11-23 08:12:25.015462: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-11-23 08:12:25.016147: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-11-23 08:12:25,200 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 1
2021-11-23 08:12:25,205 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2021-11-23 08:12:25,205 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
The shape of this layer does not match original model: conv1
Loading the model as a pruned model.
2021-11-23 08:17:55,231 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2021-11-23 08:17:55,231 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2021-11-23 08:17:55,231 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2021-11-23 08:17:55,231 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 16, io threads: 32, compute threads: 16, buffered batches: 4
2021-11-23 08:17:55,231 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 75, number of sources: 1, batch size per gpu: 8, steps: 10
2021-11-23 08:17:55,257 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2021-11-23 08:17:55,458 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2021-11-23 08:17:55,463 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2021-11-23 08:17:55,463 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1607, in _create_c_op
c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimensions must be equal, but are 38 and 43 for 'loss/ssd_predictions_loss/mul' (op: 'Mul') with input shapes: [8,65382,38], [8,65382,43].
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/tlt-train-g1", line 8, in <module>
sys.exit(main())
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py", line 45, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 248, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 133, in run_experiment
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 342, in compile
sample_weight, mask)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.py", line 404, in weighted
score_array = fn(y_true, y_pred)
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/architecture/ssd_loss.py", line 113, in compute_loss
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/architecture/ssd_loss.py", line 71, in bce_focal_loss
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_ops.py", line 899, in binary_op_wrapper
return func(x, y, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_ops.py", line 1206, in _mul_dispatch
return gen_math_ops.mul(x, y, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_math_ops.py", line 6701, in mul
"Mul", x=x, y=y, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1770, in __init__
control_input_ops)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1610, in _create_c_op
raise ValueError(str(e))
ValueError: Dimensions must be equal, but are 38 and 43 for 'loss/ssd_predictions_loss/mul' (op: 'Mul') with input shapes: [8,65382,38], [8,65382,43].