PyTorch Install with Python3 Broken

Just comment tools/setup_helpers/nccl.py line 9, and add line after it as USE_NCCL = False will be OK.

Hi brenozanchetta

I followed your pytorch_install.sh (using jetpack 4.2)

In the repo
line 24: wget https://rpmfind.net/linux/mageia/distrib/cauldron/aarch64/media/core/release/ninja-1.8.2-3.mga7.aarch64.rpm

file ninja-1.8.2-3.mga7.aarch64.rpm is missing.

however i tried with ninja-1.9.0-2.mga7.aarch64.rpm, but stuck at this step

-- Build files have been written to: /home/venv/pytorch/build
[6/2849] Building CXX object third_party/protobuf/cmake/CMakeFi...bprotobuf.dir/__/src/google/protobuf/implicit_weak_message.cc.o
[17/2849] Building CXX object third_party/protobuf/cmake/CMakeFiles/libprotobuf.dir/__/src/google/protobuf/message_lite.cc.o

nvlink error   : entry function '_Z33ncclAllReduceTreeLLKernel_sum_i328ncclColl' with max regcount of 80 calls function '_ZN14ncclPrimitivesILi4ELi2ELi2EdLi1ELi1E7FuncMaxIdEE9GenericOpILi0ELi0ELi1ELi0ELi1ELi1EEEvPKdPdii' with regcount of 96
nvlink error   : entry function '_Z33ncclAllReduceRingLLKernel_sum_i328ncclColl' with max regcount of 80 calls function '_ZN14ncclPrimitivesILi4ELi2ELi2EdLi1ELi1E7FuncMaxIdEE9GenericOpILi0ELi0ELi1ELi0ELi1ELi1EEEvPKdPdii' with regcount of 96
nvlink error   : entry function '_Z32ncclAllReduceTreeLLKernel_sum_u88ncclColl' with max regcount of 80 calls function '_ZN14ncclPrimitivesILi4ELi2ELi2EdLi1ELi1E7FuncMaxIdEE9GenericOpILi0ELi0ELi1ELi0ELi1ELi1EEEvPKdPdii' with regcount of 96
nvlink error   : entry function '_Z32ncclAllReduceRingLLKernel_sum_u88ncclColl' with max regcount of 80 calls function '_ZN14ncclPrimitivesILi4ELi2ELi2EdLi1ELi1E7FuncMaxIdEE9GenericOpILi0ELi0ELi1ELi0ELi1ELi1EEEvPKdPdii' with regcount of 96
nvlink error   : entry function '_Z32ncclAllReduceTreeLLKernel_sum_i88ncclColl' with max regcount of 80 calls function '_ZN14ncclPrimitivesILi4ELi2ELi2EdLi1ELi1E7FuncMaxIdEE9GenericOpILi0ELi0ELi1ELi0ELi1ELi1EEEvPKdPdii' with regcount of 96
nvlink error   : entry function '_Z32ncclAllReduceRingLLKernel_sum_i88ncclColl' with max regcount of 80 calls function '_ZN14ncclPrimitivesILi4ELi2ELi2EdLi1ELi1E7FuncMaxIdEE9GenericOpILi0ELi0ELi1ELi0ELi1ELi1EEEvPKdPdii' with regcount of 96
Makefile:68: recipe for target '/home/venv/pytorch/build/nccl/obj/collectives/device/devlink.o' failed
make[2]: *** [/home/venv/pytorch/build/nccl/obj/collectives/device/devlink.o] Error 255
make[2]: Leaving directory '/home/venv/pytorch/third_party/nccl/nccl/src/collectives/device'
Makefile:44: recipe for target '/home/venv/pytorch/build/nccl/obj/collectives/device/colldevice.a' failed
make[1]: *** [/home/venv/pytorch/build/nccl/obj/collectives/device/colldevice.a] Error 2
make[1]: Leaving directory '/home/venv/pytorch/third_party/nccl/nccl/src'
Makefile:25: recipe for target 'src.build' failed
make: *** [src.build] Error 2
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "setup.py", line 757, in <module>
    build_deps()
  File "setup.py", line 317, in build_deps
    build_dir='build')
  File "/home/venv/pytorch/tools/build_pytorch_libs.py", line 88, in build_caffe2
    check_call(build_cmd, cwd=build_dir, env=my_env)
  File "/usr/lib/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'install', '--config', 'Debug', '--', '-j', '6']' returned non-zero exit status 1.

Thanks! Your method solves my problem. It seems this code is responsible for the incorrect controlling of ‘USE_NCCL’.

varun365, my list of commands is a collection of many posts on many forums.
Ninja was present on at least three good tutorials, however the basic installation through pip gave users too much trouble on JetPack 3.1.
However, I never tried on JetPack 4.2, so you must be good with a simple pip installation.
My memory is a little foggy, but if that does not work, one of the next solutions may help you:

  1. Try installing with alien instead of dpkg, or vice-versa.

  2. Search the version that used to be on the link that I provided elsewhere, e.g. RPM resource ninja(aarch-64)

  3. See dustin-nv, from NVIDIA, he has many good tips about this.

Hope this helps!

Hello…

I got below error message when install pytorch:

[ 41%] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/operators/reduce_ops.cc.o
[ 41%] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/operators/piecewise_linear_transform_op.cc.o
[ 41%] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/operators/listwise_l2r_op.cc.o
c++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions.
caffe2/CMakeFiles/torch_cpu.dir/build.make:7978: recipe for target ‘caffe2/CMakeFiles/torch_cpu.dir/contrib/aten/aten_op.cc.o’ failed
make[2]: *** [caffe2/CMakeFiles/torch_cpu.dir/contrib/aten/aten_op.cc.o] Error 4
make[2]: *** Waiting for unfinished jobs…
CMakeFiles/Makefile2:6447: recipe for target ‘caffe2/CMakeFiles/torch_cpu.dir/all’ failed
make[1]: *** [caffe2/CMakeFiles/torch_cpu.dir/all] Error 2
Makefile:138: recipe for target ‘all’ failed
make: *** [all] Error 2
Traceback (most recent call last):
File “setup.py”, line 755, in
build_deps()
File “setup.py”, line 316, in build_deps
cmake=cmake)
File “/home/nvidia/pytorch/tools/build_pytorch_libs.py”, line 62, in build_caffe2
cmake.build(my_env)
File “/home/nvidia/pytorch/tools/setup_helpers/cmake.py”, line 339, in build
self.run(build_args, my_env)
File “/home/nvidia/pytorch/tools/setup_helpers/cmake.py”, line 141, in run
check_call(command, cwd=self.build_dir, env=env)
File “/usr/lib/python3.5/subprocess.py”, line 581, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ‘[‘cmake’, ‘–build’, ‘.’, ‘–target’, ‘install’, ‘–config’, ‘Release’, ‘–’, ‘-j’, ‘6’]’ returned non-zero exit status

Anyone have any idea??

Thanks.

Hi ela_new2020,

Please refer to https://elinux.org/Jetson_Zoo.
If still met issue, please file a new topic for your own issue.

Thanks