Bluefield not reachable from the host after installing CUDA on the Bluefield

Installed CUDA on the Bluefield. At the last stage of installation, the UEFI screen opened and asked for a password that would be called for after reboot. However after reboot am no longer able to ssh into the Bluefield (and so have not seen any UEFI screen asking for that password).

How can I reach the Bluefield from the Host again?

The sequence I used was:
Install DOCA
Install CUDA (as per instructions at the end of the DOCA instructions)

sudo mst start
sudo mlxconfig -d /dev/mst/mt41686_pciconf0 q | grep -i internal_cpu_model
sudo ip addr add 192.168.100.1/24 dev tmfifo_net0
Ping 192.168.100.2
ssh ubuntu@192.168.100.2


   25  exit
   26  sudo mst start
   27  sudo mlxconfig -d /dev/mst/mt41686_pciconf0 q | grep -i internal_cpu_model
   28  sudo ip addr add 192.168.100.1/24 dev tmfifo_net0
   29  ping 192.168.100.2
   30  ip a
   31  ping 192.168.100.2
   32  sudo mlxconfig -d /dev/mst/mt41686_pciconf0 q
   33  ip a
   34  sudo mlxconfig -d /dev/mst/mt41686_pciconf0 q | grep -i uefi
   35  history
admin@localhost:~> ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: em3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether e4:3d:1a:80:e3:50 brd ff:ff:ff:ff:ff:ff
    altname eno12399np0
    altname enp49s0f0np0
    inet 192.168.198.21/24 brd 192.168.198.255 scope global em3
       valid_lft forever preferred_lft forever
    inet6 fe80::e63d:1aff:fe80:e350/64 scope link
       valid_lft forever preferred_lft forever
3: em4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether e4:3d:1a:80:e3:51 brd ff:ff:ff:ff:ff:ff
    altname eno12409np1
    altname enp49s0f1np1
4: em1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether b0:7b:25:d4:51:62 brd ff:ff:ff:ff:ff:ff
    altname eno8303
    altname enp4s0f0
5: em2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether b0:7b:25:d4:51:63 brd ff:ff:ff:ff:ff:ff
    altname eno8403
    altname enp4s0f1
6: tmfifo_net0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether 00:1a:ca:ff:ff:02 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.1/24 scope global tmfifo_net0
       valid_lft forever preferred_lft forever
    inet6 fe80::21a:caff:feff:ff02/64 scope link
       valid_lft forever preferred_lft forever
7: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:8d:bb:01:ef brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
admin@localhost:~> ping ubuntu@192.168.100.2
ping: ubuntu@192.168.100.2: Name or service not known
admin@localhost:~> ping 192.168.100.2
PING 192.168.100.2 (192.168.100.2) 56(84) bytes of data.
From 192.168.100.1 icmp_seq=1 Destination Host Unreachable
From 192.168.100.1 icmp_seq=2 Destination Host Unreachable
From 192.168.100.1 icmp_seq=3 Destination Host Unreachable
From 192.168.100.1 icmp_seq=4 Destination Host Unreachable
From 192.168.100.1 icmp_seq=5 Destination Host Unreachable
From 192.168.100.1 icmp_seq=6 Destination Host Unreachable
^C
--- 192.168.100.2 ping statistics ---
9 packets transmitted, 0 received, +6 errors, 100% packet loss, time 8182ms
pipe 4
admin@localhost:~>

Have tried everything I can think of to do (remotely). Is there any way to save this card?

instructions were those from DOCA, the last instruction is the one that caused the problem with UEFI :

Download Installer for Linux Ubuntu 20.04 arm64-sbsa
The base installer is available for download below.

Base Installer	
Installation Instructions:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/sbsa/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.0.1/local_installers/cuda-repo-ubuntu2004-12-0-local_12.0.1-525.85.12-1_arm64.deb
sudo dpkg -i cuda-repo-ubuntu2004-12-0-local_12.0.1-525.85.12-1_arm64.deb
sudo cp /var/cuda-repo-ubuntu2004-12-0-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

hi brandt33

I suggest you can reburn a BFB to recover the card:

or you can try minicom -o -D /dev/rshim0/console to login arm on BF card(if the cpu still can startup).

If still have the issue after reburn the BFB, you can contact networking-support@nvidia.com

Thank you
M,S

Hi M,S,

Thank you for your note and suggestions.

The cpu is reachable and comes up normally. But there is no minicom. Is there an alternative to minicom that I could use ? …

Thank you,

Brandt

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.