Bluefield not reachable from the host after installing CUDA on the Bluefield

brandt33 · February 7, 2023, 2:14am

Installed CUDA on the Bluefield. At the last stage of installation, the UEFI screen opened and asked for a password that would be called for after reboot. However after reboot am no longer able to ssh into the Bluefield (and so have not seen any UEFI screen asking for that password).

How can I reach the Bluefield from the Host again?

The sequence I used was:
Install DOCA
Install CUDA (as per instructions at the end of the DOCA instructions)

sudo mst start
sudo mlxconfig -d /dev/mst/mt41686_pciconf0 q | grep -i internal_cpu_model
sudo ip addr add 192.168.100.1/24 dev tmfifo_net0
Ping 192.168.100.2
ssh ubuntu@192.168.100.2


   25  exit
   26  sudo mst start
   27  sudo mlxconfig -d /dev/mst/mt41686_pciconf0 q | grep -i internal_cpu_model
   28  sudo ip addr add 192.168.100.1/24 dev tmfifo_net0
   29  ping 192.168.100.2
   30  ip a
   31  ping 192.168.100.2
   32  sudo mlxconfig -d /dev/mst/mt41686_pciconf0 q
   33  ip a
   34  sudo mlxconfig -d /dev/mst/mt41686_pciconf0 q | grep -i uefi
   35  history
admin@localhost:~> ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: em3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether e4:3d:1a:80:e3:50 brd ff:ff:ff:ff:ff:ff
    altname eno12399np0
    altname enp49s0f0np0
    inet 192.168.198.21/24 brd 192.168.198.255 scope global em3
       valid_lft forever preferred_lft forever
    inet6 fe80::e63d:1aff:fe80:e350/64 scope link
       valid_lft forever preferred_lft forever
3: em4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether e4:3d:1a:80:e3:51 brd ff:ff:ff:ff:ff:ff
    altname eno12409np1
    altname enp49s0f1np1
4: em1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether b0:7b:25:d4:51:62 brd ff:ff:ff:ff:ff:ff
    altname eno8303
    altname enp4s0f0
5: em2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether b0:7b:25:d4:51:63 brd ff:ff:ff:ff:ff:ff
    altname eno8403
    altname enp4s0f1
6: tmfifo_net0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether 00:1a:ca:ff:ff:02 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.1/24 scope global tmfifo_net0
       valid_lft forever preferred_lft forever
    inet6 fe80::21a:caff:feff:ff02/64 scope link
       valid_lft forever preferred_lft forever
7: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:8d:bb:01:ef brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
admin@localhost:~> ping ubuntu@192.168.100.2
ping: ubuntu@192.168.100.2: Name or service not known
admin@localhost:~> ping 192.168.100.2
PING 192.168.100.2 (192.168.100.2) 56(84) bytes of data.
From 192.168.100.1 icmp_seq=1 Destination Host Unreachable
From 192.168.100.1 icmp_seq=2 Destination Host Unreachable
From 192.168.100.1 icmp_seq=3 Destination Host Unreachable
From 192.168.100.1 icmp_seq=4 Destination Host Unreachable
From 192.168.100.1 icmp_seq=5 Destination Host Unreachable
From 192.168.100.1 icmp_seq=6 Destination Host Unreachable
^C
--- 192.168.100.2 ping statistics ---
9 packets transmitted, 0 received, +6 errors, 100% packet loss, time 8182ms
pipe 4
admin@localhost:~>

brandt33 · February 7, 2023, 3:30am

Have tried everything I can think of to do (remotely). Is there any way to save this card?

instructions were those from DOCA, the last instruction is the one that caused the problem with UEFI :

Download Installer for Linux Ubuntu 20.04 arm64-sbsa
The base installer is available for download below.

Base Installer	
Installation Instructions:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/sbsa/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.0.1/local_installers/cuda-repo-ubuntu2004-12-0-local_12.0.1-525.85.12-1_arm64.deb
sudo dpkg -i cuda-repo-ubuntu2004-12-0-local_12.0.1-525.85.12-1_arm64.deb
sudo cp /var/cuda-repo-ubuntu2004-12-0-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

shim1 · February 7, 2023, 7:10am

hi brandt33

I suggest you can reburn a BFB to recover the card:

or you can try minicom -o -D /dev/rshim0/console to login arm on BF card(if the cpu still can startup).

If still have the issue after reburn the BFB, you can contact networking-support@nvidia.com

Thank you
M,S

brandt33 · February 7, 2023, 4:00pm

Hi M,S,

Thank you for your note and suggestions.

The cpu is reachable and comes up normally. But there is no minicom. Is there an alternative to minicom that I could use ? …

Thank you,

Brandt

system · February 21, 2023, 4:01pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
connect/RDMA between host and DPU BlueField	2	927	July 25, 2023
Bluefield-2 firmware is dead, cannot access, hang up at boot, cannot reflash BlueField	12	2737	December 27, 2023
How to update BlueField-2 DPU with the DOCA 2.2.0 firmware image BlueField	0	2015	November 7, 2023
BlueField-2 in BlueField-X mode does not see A100 GPU BlueField cuda	5	1393	February 17, 2023
BlueField 3 is not working BlueField	2	802	August 16, 2024
It sticked when I tried to install ubuntu 22.04 for Bluefield2 from terminal or sdkmanager BlueField	10	1115	January 13, 2024
CUDA 10 installation problems on Ubuntu 18.04 CUDA Setup and Installation	24	94543	December 11, 2020
Issues with upgrading firmware of Bluefield 2 DPU BlueField	3	563	July 5, 2024
Libcudadebugger.so is missing CUDA-GDB	8	2031	December 29, 2022
After mode switch still unable to expose GPU to the DPU side Getting Started & Resources	0	1008	July 1, 2022

Bluefield not reachable from the host after installing CUDA on the Bluefield

Related topics