Command:
tlt ssd train -e ssd-spec.prototxt -r /output -k DroneCrowd
Full log:
2021-06-11 11:28:23,766 [INFO] root: Registry: ['nvcr.io']
2021-06-11 11:28:23,821 [INFO] root: No mount points were found in the /home/lwschlds/.tlt_mounts.json file.
2021-06-11 11:28:23,821 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
Traceback (most recent call last):
File "/usr/local/bin/ssd", line 8, in <module>
sys.exit(main())
File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/entrypoint/ssd.py", line 12, in main
File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 315, in launch_job
File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 224, in set_gpu_info_single_node
File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 192, in check_valid_gpus
File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 423, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'nvidia-smi': 'nvidia-smi'
2021-06-11 11:28:28,848 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
What is the ssd-spec.prototxt ? Can you share it?
Its the experiment spec file
random_seed: 42
ssd_config {
aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 0.33]"
aspect_ratios: "[[1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0, 2.0, 0.5, 3.0, 0.33]]"
scales: "[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]"
two_boxes_for_ar1: true
clip_boxes: false
variances: "[0.1, 0.1, 0.2, 0.2]"
arch: "resnet"
nlayers: 18
freeze_bn: false
freeze_blocks: 0
}
training_config {
batch_size_per_gpu: 16
num_epochs: 80
enable_qat: false
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-5
max_learning_rate: 2e-2
soft_start: 0.15
annealing: 0.8
}
}
regularizer {
type: L1
weight: 3e-5
}
}
eval_config {
validation_period_during_training: 10
average_precision_mode: SAMPLE
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.01
clustering_iou_threshold: 0.6
top_k: 200
}
augmentation_config {
output_width: 300
output_height: 300
output_channel: 3
}
dataset_config {
data_sources: {
label_directory_path: "../Crowd-Detection/train-KITTI-annotations"
image_directory_path: "../Crowd-Detection/train-images"
}
include_difficult_in_training: true
target_class_mapping {
key: "pedestrian"
value: "pedestrian"
}
validation_data_sources: {
label_directory_path: "../Crowd-Detection/val-KITTI-annotations"
image_directory_path: "../Crowd-Detection/val-images"
}
}
Can you run following command and paste the result?
tlt ssd run nvidia-smi
2021-06-11 11:59:19,608 [INFO] root: Registry: ['nvcr.io']
2021-06-11 11:59:19,666 [INFO] root: No mount points were found in the /home/lwschlds/.tlt_mounts.json file.
2021-06-11 11:59:19,666 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: "nvidia-smi": executable file not found in $PATH: unknown
2021-06-11 11:59:20,928 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
How about
tlt ssd run /usr/lib/wsl/lib/nvidia-smi
How about running below command mentioned in Installation Guide — NVIDIA Cloud Native Technologies documentation ?
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "nvidia-smi": executable file not found in $PATH: unknown.
Can I add it to $PATH?
From the hint of https://github.com/NVIDIA/nvidia-docker/issues/1382, can you run
sudo docker run --rm --gpus all nvidia/cuda:9.0-base nvidia-smi
It returns the same
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "nvidia-smi": executable file not found in $PATH: unknown.
Thanks for continuing to help
Please ask your question in https://github.com/NVIDIA/nvidia-docker/issues/ too.
I find there is a similar topic https://github.com/NVIDIA/nvidia-docker/issues/1382.
But suggest you to create a new topic and describe your environment clearly on that , i.e., how did you install WSL2 and run what command and get what error.
I fixed it by running
$ sudo cp /usr/lib/wsl/lib/nvidia-smi /usr/bin/nvidia-smi
$sudo chmod ogu+x /usr/bin/nvidia-smi
Hopefully, the last thing, how do I specify the spec file in the tlt train command?
When I have cd into the relevant file with the txt file and run the command it returns:
FileNotFoundError: [Errno 2] No such file or directory: ‘SPEC.txt’
1 Like
The path after tlt ssd train
should be the path inside the docker. Please refer to TLT Launcher — Transfer Learning Toolkit 3.0 documentation
You can run something like below.
tlt ssd train -e path_to_ssd-spec.txt_inside_the_docker -r path_to_output_inside_the_docker -k DroneCrowd
Do you mean the section involving
~/.tlt_mounts.json
?
If so how do I find the launcher config file?
Secondly, how do I find the path inside of the docker?
Thanks
The ~/.tlt_mounts.json is generated by you. The drives/mount points need to be mapped to the docker.
Or, you can also run with an interactive mode,
Could you please give an example of how the source directories in the files should be formatted? It is recognising the ~/.tlt_mounts.json file now but still cannot see the spec file.
Also does it matter where I am running the tlt system?
There is an example for the ~/.tlt_mounts.json, see TLT Launcher — Transfer Learning Toolkit 3.0 documentation. Below is part of it.
“source”: “/path/to/your/data”,
“destination”: “/workspace/tlt-experiments/data”
For example, if your training spec file locates at /home/LwsChlds/spec.txt , and set as above, then the path in the docker for the spec file should be /workspace/tlt-experiments/spec.txt
Then, you can run
tlt ssd train -e /workspace/tlt-experiments/spec.txt -r /workspace/tlt-experiments/ouput -k DroneCrowd
Thanks, I am making progress it is now able to find the files however it is now looking for a “weights” file or directory in my output folder.
Is this something I should have made or is there a way to allow it to make the files themselves?
Also are these meant to be one row down from the corresponding value?
Can you share the full log?