Rshim[17469]: another backend already attached

Hi,

I’m trying to re-image the DPU using bfp-install utility which comes by default as part of rshim driver package on Centos. But I’m seeing rshim driver reporting an error “another backend already attached”.

Apr 26 04:33:07 5a9s9-node4 rshim[17469]: Probing pcie-0000:81:00.2

Apr 26 04:33:07 rshim[17469]: create rshim pcie-0000:81:00.2

Apr 26 04:33:08 rshim[17469]: another backend already attached

Apr 26 04:33:08 rshim[17469]: USB device detected

This causes no rshim device entry(/dev/rshim0) and network interface(tmfifo_net0) for DPU at host machine. So, without these we cannot access DPU from host.

One important observation occurred during re-imaging is that re-image using bfb-install got stuck for 5+ hours and to come out of the stuck bfb-install command, I had issued reboot of host machine.

I have tried to solve this issue by referring to NVIDIA troubleshoot guide and work-around suggested in this forum for other users. But the fixes suggested at above links didn’t solve the issue.

So, I started looking into rshim driver source code to understand when driver reports this error.

From rshim driver code here rshim/rshim.c at 6dc1c010e809ab744dc6f387e6804ad61498d9c9 · Mellanox/rshim · GitHub , the reason for this error is one of the rshim register named ‘RSH_SCRATCHPAD1’ is written a some magic value indicator of some other rshim backend(probably, usb or pcie_live_fish) is holding the access to DPU.

I suspected, usb backend could be that other backend which competed with pcie to get access to DPU and overwriting SCRATCHPAD1 register on DPU . So, I explicitly disabled usb backend by adding rshim configuration at ‘/etc/modprobe.d/rshim.conf’ to disable usb & pcie_live_fish backends). But disabling other backends also does not reset register and rshim still reports the same error & not proceeding for device register.

Few other things I have tried without success are below:

  1. Tried to run rshim driver in foreground mode by explicitly passing backend device in debug mode as below.

[root]# rshim -b pcie -d pcie-0000:81:00.2 -i 0 -l 4 -f

Probing pcie-0000:81:00.2

create rshim pcie-0000:81:00.2

another backend already attached

I see the same error here also.

  1. I restarted MST(Mellanox Software Tools) and tried to reset nvc config using Mellanox config tool.
    The intention behind this step was to create network interface tmfifo_net0. But, that didn’t succeed as it requires rshim device entry as pre-requisite.

Any help appreciated.

Thanks,
Ganesh

Hello,

Can you post “systemctl status rshim”.
Any change if you do a cold reboot? (power cycled the server)
Are you connected via USB or PCIe?
Was it working initially as expected?
Did this issue occurred upon initiating a bfb-install?

Sophie.

root@5a9s9-node4:~# systemctl status rshim
● rshim.service - rshim driver for BlueField SoC
Loaded: loaded (/lib/systemd/system/rshim.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2022-05-13 16:17:18 PDT; 1 weeks 2 days ago
Docs: man:rshim(8)
Process: 2184 ExecStart=/usr/sbin/rshim $OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 2201 (rshim)
Tasks: 7 (limit: 309093)
Memory: 3.2M
CGroup: /system.slice/rshim.service
└─2201 /usr/sbin/rshim

May 13 16:17:18 5a9s9-node4 systemd[1]: Started rshim driver for BlueField SoC.
May 13 16:17:18 5a9s9-node4 rshim[2201]: Probing pcie-0000:81:00.2(uio)
May 13 16:17:18 5a9s9-node4 rshim[2201]: Create rshim pcie-0000:81:00.2
May 13 16:17:18 5a9s9-node4 rshim[2201]: rshim pcie-0000:81:00.2 enable
May 13 16:17:19 5a9s9-node4 rshim[2201]: rshim0 attached
May 13 16:17:19 5a9s9-node4 rshim[2201]: USB device detected
May 13 16:17:19 5a9s9-node4 rshim[2201]: Probing usb-3.b
May 13 16:17:19 5a9s9-node4 rshim[2201]: create rshim usb-3.b
May 13 16:17:19 5a9s9-node4 rshim[2201]: another backend already attached
May 13 16:17:19 5a9s9-node4 rshim[2201]: rshim usb-3.b deleted

Any change if you do a cold reboot? (power cycled the server)
==> No change.

Are you connected via USB or PCIe?
==> PCIe

Was it working initially as expected?
==> Yes

Did this issue occurred upon initiating a bfb-install?
==> Yes. The bfb-install command to re-image DPU got stuck for 5+ hours w/o any progress, so to come out of this state, I issued reboot of host machine. Post reboot, I started to see this issue.

Thanks,
Ganesh