Docker instantiation failed with error: 500 Server Error: Internal Server Error ("OCI runtime create failed...)

Command:

 tlt ssd train -e ssd-spec.prototxt -r /output -k DroneCrowd

Full log:

2021-06-11 11:28:23,766 [INFO] root: Registry: ['nvcr.io']
2021-06-11 11:28:23,821 [INFO] root: No mount points were found in the /home/lwschlds/.tlt_mounts.json file.
2021-06-11 11:28:23,821 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/ssd", line 8, in <module>
    sys.exit(main())
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/entrypoint/ssd.py", line 12, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 315, in launch_job
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 224, in set_gpu_info_single_node
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 192, in check_valid_gpus
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'nvidia-smi': 'nvidia-smi'
2021-06-11 11:28:28,848 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

What is the ssd-spec.prototxt ? Can you share it?

Its the experiment spec file

random_seed: 42
ssd_config {
  aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 0.33]"
  aspect_ratios: "[[1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0, 2.0, 0.5, 3.0, 0.33]]"
  scales: "[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]"
  two_boxes_for_ar1: true
  clip_boxes: false
  variances: "[0.1, 0.1, 0.2, 0.2]"
  arch: "resnet"
  nlayers: 18
  freeze_bn: false
  freeze_blocks: 0
}
training_config {
  batch_size_per_gpu: 16
  num_epochs: 80
  enable_qat: false
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 5e-5
    max_learning_rate: 2e-2
    soft_start: 0.15
    annealing: 0.8
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
}
eval_config {
  validation_period_during_training: 10
  average_precision_mode: SAMPLE
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  clustering_iou_threshold: 0.6
  top_k: 200
}
augmentation_config {
  output_width: 300
  output_height: 300
  output_channel: 3
}
dataset_config {
  data_sources: {
    label_directory_path: "../Crowd-Detection/train-KITTI-annotations"
    image_directory_path: "../Crowd-Detection/train-images"
  }
  include_difficult_in_training: true
  target_class_mapping {
      key: "pedestrian"
      value: "pedestrian"
  }
  validation_data_sources: {
    label_directory_path: "../Crowd-Detection/val-KITTI-annotations"
    image_directory_path: "../Crowd-Detection/val-images"
  }
}

Can you run following command and paste the result?
tlt ssd run nvidia-smi

2021-06-11 11:59:19,608 [INFO] root: Registry: ['nvcr.io']
2021-06-11 11:59:19,666 [INFO] root: No mount points were found in the /home/lwschlds/.tlt_mounts.json file.
2021-06-11 11:59:19,666 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: "nvidia-smi": executable file not found in $PATH: unknown
2021-06-11 11:59:20,928 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

How about
tlt ssd run /usr/lib/wsl/lib/nvidia-smi

How about running below command mentioned in Installation Guide — NVIDIA Cloud Native Technologies documentation ?
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "nvidia-smi": executable file not found in $PATH: unknown.

Can I add it to $PATH?

From the hint of centos container_linux.go:345: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown. · Issue #1382 · NVIDIA/nvidia-docker · GitHub, can you run
sudo docker run --rm --gpus all nvidia/cuda:9.0-base nvidia-smi

It returns the same

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "nvidia-smi": executable file not found in $PATH: unknown.

Thanks for continuing to help

Please ask your question in Issues · NVIDIA/nvidia-docker · GitHub too.
I find there is a similar topic centos container_linux.go:345: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown. · Issue #1382 · NVIDIA/nvidia-docker · GitHub.
But suggest you to create a new topic and describe your environment clearly on that , i.e., how did you install WSL2 and run what command and get what error.

I fixed it by running

$ sudo cp /usr/lib/wsl/lib/nvidia-smi /usr/bin/nvidia-smi
$sudo chmod ogu+x /usr/bin/nvidia-smi

Hopefully, the last thing, how do I specify the spec file in the tlt train command?
When I have cd into the relevant file with the txt file and run the command it returns:

FileNotFoundError: [Errno 2] No such file or directory: ‘SPEC.txt’

1 Like

The path after tlt ssd train should be the path inside the docker. Please refer to TLT Launcher — Transfer Learning Toolkit 3.0 documentation

You can run something like below.
tlt ssd train -e path_to_ssd-spec.txt_inside_the_docker -r path_to_output_inside_the_docker -k DroneCrowd

Do you mean the section involving

~/.tlt_mounts.json

?

If so how do I find the launcher config file?
Secondly, how do I find the path inside of the docker?

Thanks

The ~/.tlt_mounts.json is generated by you. The drives/mount points need to be mapped to the docker.

Or, you can also run with an interactive mode,

Could you please give an example of how the source directories in the files should be formatted? It is recognising the ~/.tlt_mounts.json file now but still cannot see the spec file.

Also does it matter where I am running the tlt system?

There is an example for the ~/.tlt_mounts.json, see TLT Launcher — Transfer Learning Toolkit 3.0 documentation. Below is part of it.
“source”: “/path/to/your/data”,
“destination”: “/workspace/tlt-experiments/data”

For example, if your training spec file locates at /home/LwsChlds/spec.txt , and set as above, then the path in the docker for the spec file should be /workspace/tlt-experiments/spec.txt

Then, you can run
tlt ssd train -e /workspace/tlt-experiments/spec.txt -r /workspace/tlt-experiments/ouput -k DroneCrowd

Thanks, I am making progress it is now able to find the files however it is now looking for a “weights” file or directory in my output folder.

Is this something I should have made or is there a way to allow it to make the files themselves?

Also are these meant to be one row down from the corresponding value?

Can you share the full log?