Build Docker error standard_init_linux.go:219

I am try to build this docker on Jetson Xavier NX (also follow build step in the link)
https://github.com/facebookresearch/detectron2/blob/master/docker/Dockerfile

I got the error when docker build as:

`
Step 3/21 : RUN apt-get update && apt-get install -y python3-opencv ca-certificates python3-dev git wget sudo ninja-build
—> [Warning] The requested image’s platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

—> Running in c1fa8affec20
standard_init_linux.go:219: exec user process caused: exec format error

The command ‘/bin/sh -c apt-get update && apt-get install -y python3-opencv ca-certificates python3-dev git wget sudo ninja-build’ returned a non-zero code: 1
`
Maybe the problem from apt-get update and install… command in Docker. But I don’t know why honestly.
I also usually get error “standard_init_linux.go:219:…” when I build docker on jetson xavier NX but I don’t know why and how to fix it.

Thank you very much.
Nguyen Ngoc Dat

Hi @nguyenngocdat1995, I believe the issue is that this Dockerfile is using a base container for x86, not aarch64:

FROM nvidia/cuda:10.1-cudnn7-devel

nvidia/cuda:10.1-cudnn7-devel is an x86_64 container, not aarch64. So you need to change this line to use one of the L4T containers instead. I recommend l4t-pytorch or l4t-ml since it appears that this detectron2 build needs PyTorch.

You should pick a base container that matches your L4T version (which you can find with cat /etc/nv_tegra_release. For example, if you are on R32.5.0 or R32.5.1, you can use nvcr.io/nvidia/l4t-pytorch:r32.5.0-pth1.7-py3

FROM nvcr.io/nvidia/l4t-pytorch:r32.5.0-pth1.7-py3

Also I see another line from that Dockerfile which installs PyTorch from pip. However the PyTorch wheels for Jetson aren’t installed from PyPi and the l4t-pytorch/l4t-ml base containers already have PyTorch and torchvision. So you will want to comment this out:

RUN pip install --user torch==1.8 torchvision==0.9 -f https://download.pytorch.org/whl/cu101/torch_stable.html

Note that I haven’t installed detectron2 package before, so I may be of limited help if you encounter further errors there.

Thank you.
I follow your recommend as:

cat /etc/nv_tegra_release

# R32 (release), REVISION: 4.4, GCID: 23942405, BOARD: t186ref, EABI: aarch64, DATE: Fri Oct 16 19:37:08 UTC 2020

so I follow your refernce link then set as:

FROM nvcr.io/nvidia/l4t-pytorch:r32.4.4-pth1.6-py3

and I also skip the comand install torch again.
I think it pass previous error.

But at command later

RUN pip install --user -e detectron2_repo

I got problem
File "/home/appuser/detectron2_repo/setup.py", line 10, in <module> import torch File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 188, in <module> _load_global_deps() File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 141, in _load_global_deps ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL) File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__ self._handle = _dlopen(self._name, mode) OSError: libcurand.so.10: cannot open shared object file: No such file or directory ---------------------------------------- WARNING: Discarding file:///home/appuser/detectron2_repo. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Then I follow this link : https://github.com/NVIDIA-AI-IOT/torch2trt/issues/483

but I got issue
Step 17/19 : RUN pip install --user -e detectron2_repo
—> Running in 31acb18c59d9
OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/moby/31acb18c59d9b4efa7f2db47b3287426e8671cc21dfda8a49880ee67ffaa5493/log.json: no such file or directory): fork/exec usr/bin/nvidia-container-runtime: no such file or directory: unknown

Can you give me some suggestion to complete the docker build?
Thanks a lot
Nguyen Ngoc Dat

Thank you, I built completely.
Nguyen Ngoc Dat

Try setting your default docker-runtime to nvidia and restart - https://github.com/dusty-nv/jetson-containers#docker-default-runtime

Ah ok, great to hear that you got it built. Thanks.

Hello, it is me again,
When I run train code. it has 2 problems:

  1. when I start train code at

/home/appuser/detectron2_repo/projects/TensorMask/train_net.py

. I got error cuda is not available.
from command assert torch.cuda.is_available()

I also try to run independence python3 code ( inside docker this container) to check cuda as:

`import torch`
 print(torch.cuda.is_available())

the result is False.
Maybe problem when from docker file .

ENV FORCE_CUDA="1"
ARG TORCH_CUDA_ARCH_LIST="Kepler;Kepler+Tesla;Maxwell;Maxwell+Tegra;Pascal;Volta;Turing"
ENV TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST}"

Do you have any suggestion???

  1. second problem :may be it not error but i would like to ask you how to solve it. That is

Failed to load OpenCL runtime

  1. that last question is: I have argument set number of gpu for training. How can I check number of GPU on Jetson Xavier NX?

Thank you very much.
Nguyen Ngoc Dat

here is my docker file

# FROM nvidia/cuda-arm64:11.1.1-cudnn8-devel
# FROM nvidia/cuda:10.1-cudnn7-devel

FROM nvcr.io/nvidia/l4t-pytorch:r32.4.4-pth1.6-py3

ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y \
	python3-opencv ca-certificates python3-dev git wget sudo ninja-build nano
# RUN ln -sv /usr/bin/python3 /usr/bin/python

# create a non-root user
ARG USER_ID=1000
RUN useradd -m --no-log-init --system  --uid ${USER_ID} appuser -g sudo
RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
USER appuser
WORKDIR /home/appuser

ENV PATH="/home/appuser/.local/bin:${PATH}"
RUN wget https://bootstrap.pypa.io/get-pip.py && \
	python3 get-pip.py --user && \
	rm get-pip.py

# install dependencies
# See https://pytorch.org/ for other options if you use a different version of CUDA
RUN pip install --user tensorboard cmake   # cmake from apt-get is too old
# RUN pip install --user torch==1.8 torchvision==0.9.1 -f https://download.pytorch.org/whl/cu101/torch_stable.html

RUN pip install --user 'git+https://github.com/facebookresearch/fvcore'
# install detectron2
RUN git clone https://github.com/facebookresearch/detectron2 detectron2_repo
# set FORCE_CUDA because during `docker build` cuda is not accessible
ENV FORCE_CUDA="1"
# This will by default build detectron2 for all common cuda architectures and take a lot more time,
# because inside `docker build`, there is no way to tell which architecture will be used.
ARG TORCH_CUDA_ARCH_LIST="Kepler;Kepler+Tesla;Maxwell;Maxwell+Tegra;Pascal;Volta;Turing"
ENV TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST}"

RUN pip install --user -e detectron2_repo
# RUN pip install -e detectron2_repo

# Set a fixed model cache directory.
ENV FVCORE_CACHE="/tmp"
WORKDIR /home/appuser/detectron2_repo
#ADD /home/robot/program/data/dataset_wgisd /home/appuser/detectron2_repo/dataset_wgisd
ADD add_docker_ws /home/appuser/detectron2_repo/add_docker_ws
WORKDIR /home/appuser/detectron2_repo

# run detectron2 under user "appuser":
# wget http://images.cocodataset.org/val2017/000000439715.jpg -O input.jpg
# python3 demo/demo.py  \
	#--config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml \
	#--input input.jpg --output outputs/ \
	#--opts MODEL.WEIGHTS detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl

I try without docker: I got the same error about:
Distributed package doesn't have NCCL built in

Hope you can support me. Sorry for ask a lots :)

more detail the error here:

Command Line Args: Namespace(config_file=‘configs/tensormask_R_50_FPN_1x.yaml’, dist_url=‘tcp://127.0.0.1:50152’, eval_only=False, machine_rank=0, num_gpus=8, num_machines=1, opts=, resume=False)
Process group URL: tcp://127.0.0.1:50152
Process group URL: tcp://127.0.0.1:50152
Process group URL: tcp://127.0.0.1:50152
Process group URL: tcp://127.0.0.1:50152
Traceback (most recent call last):
File “/home/robot/program/detectron2/projects/TensorMask/train.py”, line 150, in
args=(args,),
File “/home/robot/program/detectron2/detectron2/engine/launch.py”, line 79, in launch
daemon=False,
File “/home/robot/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method=‘spawn’)
File “/home/robot/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 188, in start_processes
while not context.join():
File “/home/robot/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

– Process 6 terminated with the following error:
Traceback (most recent call last):
File “/home/robot/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 59, in _wrap
fn(i, *args)
File “/home/robot/program/detectron2/detectron2/engine/launch.py”, line 108, in _distributed_worker
raise e
File “/home/robot/program/detectron2/detectron2/engine/launch.py”, line 103, in _distributed_worker
timeout=timeout,
File “/home/robot/.local/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 510, in init_process_group
timeout=timeout))
File “/home/robot/.local/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 597, in _new_process_group_helper
raise RuntimeError("Distributed package doesn’t have NCCL "
RuntimeError: Distributed package doesn’t have NCCL built in

Hi @nguyenngocdat1995, sorry for the delay - Jetson doesn’t have NCCL, as this library is intended for multi-node servers. You may need to disable the multiprocessing in the detectron’s training.

How are you starting the container? Are you running it with --runtime nvidia flag?

If you test running the base container just like this, is PyTorch able to detect GPU?

sudo docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-pytorch:r32.4.4-pth1.6-py3