MLNX_OFED_LINUX-4.5 on Arm64 (Jetson TX2)

Hello. I’m want to use Infiniband on Jetson TX2(with Ubuntu 16.04) for sending data (6-8 GBit/s) on MCX413A-BCAT

I’m downloaded and installed MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu16.04-aarch64 package sucessful. but when try run services, i got error:

root@jetson:~# /etc/init.d/openibd start

Loading HCA driver and Access Layer: [FAILED]

and in dmesg I see:

[ 3668.345368] mlx5_0:wait_for_async_commands:659:(pid 26302): done with all pending requests

[ 3668.669829] (0000:01:00.0): E-Switch: cleanup

[ 3685.813786] Compat-mlnx-ofed backport release: b4fdfac

[ 3685.820167] Backport based on mlnx_ofed/mlnx-ofa_kernel-4.0.git b4fdfac

[ 3685.827897] compat.git: mlnx_ofed/mlnx-ofa_kernel-4.0.git

[ 3685.888572] mlx5_core 0000:01:00.0: firmware version: 12.24.1000

[ 3685.896043] mlx5_core 0000:01:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x4 link at 0000:00:01.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)

[ 3686.185440] (0000:01:00.0): E-Switch: Total vports 1, per vport: max uc(1024) max mc(16384)

[ 3686.199990] mlx5_core 0000:01:00.0: Port module event: module 0, Cable plugged

[ 3686.212889] mlx5_core 0000:01:00.0: FW Tracer Owner

[ 3686.217414] mlx5_core 0000:01:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(1)

[ 3686.380081] mlx5_ib: Mellanox Connect-IB Infiniband driver v4.5-1.0.1

[ 3686.388350] mlx5_ib: Mellanox Connect-IB Infiniband driver v4.5-1.0.1

[ 3686.484087] user_mad: couldn’t register device number

[ 3686.791433] mlx5_core 0000:01:00.0 eth1: Link up

[ 3686.797921] 8021q: adding VLAN 0 to HW filter on device eth1

while researching problem, i found ib_umad module cannot be loaded and generates this error:

root@jetson:~# modprobe ib_umad

modprobe: ERROR: could not insert ‘ib_umad’: Device or resource busy

as result infiniband not works:

root@jetson:~# sminfo

ibwarn: [1614] get_abi_version: can’t read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded?

ibwarn: [1614] mad_rpc_open_port: can’t open UMAD port ((null):0)

sminfo: iberror: failed: Failed to open ‘(null)’ port ‘0’

root@jetson:~# ibp

ibping ibportstate ibprintca.pl ibprintrt.pl ibprintswitch.pl

root@jetson:~# ibping 10.10.5.1

ibwarn: [1615] get_abi_version: can’t read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded?

ibwarn: [1615] mad_rpc_open_port: can’t open UMAD port ((null):0)

ibping: iberror: failed: Failed to open ‘(null)’ port ‘0’

root@jetson:~#

Hi Konstantin,

You can try using openibd service force restart:

/etc/init.d/openibd force-restart

Also worth checking what is holding the module by invoking lsmod.

Dikla

I tryed to restart many times, and always got same error:

root@jetson:~# /etc/init.d/openibd force-restart

Unloading HCA driver: [ OK ]

Loading HCA driver and Access Layer: [FAILED]

Please run /usr/sbin/sysinfo-snapshot.py to collect the debug information

and open an issue in the http://support.mellanox.com/SupportWeb/service_center/SelfService

root@jetson:~# lsmod

Module Size Used by

ib_ipoib 147054 0

ib_cm 43924 1 ib_ipoib

mlx5_fpga_tools 9559 0

mlx5_ib 290724 0

ib_uverbs 107704 1 mlx5_ib

mlx5_core 767481 2 mlx5_ib,mlx5_fpga_tools

mlxfw 12529 1 mlx5_core

mlx4_ib 188185 0

ib_core 252316 5 ib_cm,mlx4_ib,mlx5_ib,ib_uverbs,ib_ipoib

mlx4_en 114364 0

mlx4_core 328955 2 mlx4_en,mlx4_ib

mlx_compat 16114 10 ib_cm,mlx4_en,mlx4_ib,mlx5_ib,mlx5_fpga_tools,ib_core,ib_uverbs,mlx4_core,mlx5_core,ib_ipoib

bcmdhd 7442379 0

pci_tegra 60038 0

knem 31619 0

bluedroid_pm 11195 0

root@jetson:~# dmesg

[ 369.961352] mlx5_0:wait_for_async_commands:659:(pid 1767): done with all pending requests

[ 370.238243] (0000:01:00.0): E-Switch: cleanup

[ 376.563516] Compat-mlnx-ofed backport release: b4fdfac

[ 376.568730] Backport based on mlnx_ofed/mlnx-ofa_kernel-4.0.git b4fdfac

[ 376.575416] compat.git: mlnx_ofed/mlnx-ofa_kernel-4.0.git

[ 376.637752] mlx5_core 0000:01:00.0: firmware version: 12.24.1000

[ 376.643891] mlx5_core 0000:01:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x4 link at 0000:00:01.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)

[ 376.930927] (0000:01:00.0): E-Switch: Total vports 1, per vport: max uc(1024) max mc(16384)

[ 376.942928] mlx5_core 0000:01:00.0: Port module event: module 0, Cable plugged

[ 376.953592] mlx5_core 0000:01:00.0: FW Tracer Owner

[ 376.966764] mlx5_core 0000:01:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(1)

[ 377.129174] mlx5_ib: Mellanox Connect-IB Infiniband driver v4.5-1.0.1

[ 377.135762] mlx5_ib: Mellanox Connect-IB Infiniband driver v4.5-1.0.1

[ 377.237329] user_mad: couldn’t register device number

[ 377.496779] mlx5_core 0000:01:00.0 eth1: Link up

[ 377.503114] 8021q: adding VLAN 0 to HW filter on device eth1

root@jetson:~#

on x64 PC I use same version of OFED and openibd starts successful. Are OFED require some specific kernel configuration options?

Hi Konstantin,

The main failure of “user_mad: couldn’t register device number” was caused due to the fact that on this specific system minor number that we use for user_mad module for character device is already taken by some other device.

Dikla