I am trying to train on my custom dataset, However, I get the following error:
To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
Using TensorFlow backend.
2020-09-07 11:28:36.149216: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
The library attempted to open the following supporting CUDA libraries,
but each of them failed. CUDA-aware support is disabled.
libcuda.so.1: cannot open shared object file: No such file or directory
libcuda.dylib: cannot open shared object file: No such file or directory
/usr/lib64/libcuda.so.1: cannot open shared object file: No such file or directory
/usr/lib64/libcuda.dylib: cannot open shared object file: No such file or directory
If you are not interested in CUDA-aware support, then run with
–mca mpi_cuda_support 0 to suppress this message. If you are interested
in CUDA-aware support, then try setting LD_LIBRARY_PATH to the location
of libcuda.so.1 to get passed this issue.
2020-09-07 11:28:39.294547: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcuda.so.1’; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-07 11:28:39.294594: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2020-09-07 11:28:39.294631: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ccb148e8dabd): /proc/driver/nvidia/version does not exist
2020-09-07 11:28:39,295 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/examples/yolo/specs/yolo_train_resnet18_kitti.txt.
2020-09-07 11:28:39,297 [INFO] /usr/local/lib/python3.6/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/examples/yolo/specs/yolo_train_resnet18_kitti.txt
2020-09-07 11:28:39,322 [INFO] iva.yolo.scripts.train: Loading pretrained weights. This may take a while…
2020-09-07 11:28:39,563 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-09-07 11:28:39,563 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-09-07 11:28:39,563 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-09-07 11:28:39,563 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2020-09-07 11:28:39,564 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 1176, number of sources: 1, batch size per gpu: 16, steps: 74
2020-09-07 11:28:39,717 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2020-09-07 11:28:40,035 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 1
2020-09-07 11:28:40,043 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-09-07 11:28:40,043 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2020-09-07 11:28:40.408476: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Weights for those layers can not be loaded: [‘expand_conv1’, ‘expand_conv1_bn’, ‘expand_conv1_lrelu’]
STOP trainig now and check the pre-train model if this is not expected!
2020-09-07 11:29:57,892 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-09-07 11:29:57,892 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-09-07 11:29:57,892 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-09-07 11:29:57,892 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2020-09-07 11:29:57,892 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 191, number of sources: 1, batch size per gpu: 16, steps: 12
2020-09-07 11:29:57,928 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2020-09-07 11:29:58,264 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2020-09-07 11:29:58,271 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-09-07 11:29:58,272 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2020-09-07 11:30:03,938 [INFO] iva.yolo.scripts.train: Number of images in the training dataset: 1176
Epoch 1/80
2020-09-07 11:30:26.010654: E tensorflow/core/common_runtime/executor.cc:648] Executor failed to create kernel. Invalid argument: Conv2DCustomBackpropInputOp only supports NHWC.
[[{{node training_1/Adam/gradients/conv_big_object/convolution_grad/Conv2DBackpropInput}}]]
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 51, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo/scripts/train.py”, line 239, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo/scripts/train.py”, line 183, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1039, in fit
validation_steps=validation_steps)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2671, in _call
session)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2623, in _make_callable
callable_fn = session._make_callable_from_options(callable_opts)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1505, in _make_callable_from_options
return BaseSession._Callable(self, callable_options)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1460, in init
session._session, options_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Conv2DCustomBackpropInputOp only supports NHWC.
[[{{node training_1/Adam/gradients/conv_big_object/convolution_grad/Conv2DBackpropInput}}]]
i7 CPU