Hi,
I’m trying to re-image the DPU using bfp-install utility which comes by default as part of rshim driver package on Centos. But I’m seeing rshim driver reporting an error “another backend already attached”.
Apr 26 04:33:07 5a9s9-node4 rshim[17469]: Probing pcie-0000:81:00.2
Apr 26 04:33:07 rshim[17469]: create rshim pcie-0000:81:00.2
Apr 26 04:33:08 rshim[17469]: another backend already attached
Apr 26 04:33:08 rshim[17469]: USB device detected
This causes no rshim device entry(/dev/rshim0) and network interface(tmfifo_net0) for DPU at host machine. So, without these we cannot access DPU from host.
One important observation occurred during re-imaging is that re-image using bfb-install got stuck for 5+ hours and to come out of the stuck bfb-install command, I had issued reboot of host machine.
I have tried to solve this issue by referring to NVIDIA troubleshoot guide and work-around suggested in this forum for other users. But the fixes suggested at above links didn’t solve the issue.
So, I started looking into rshim driver source code to understand when driver reports this error.
From rshim driver code here rshim/rshim.c at 6dc1c010e809ab744dc6f387e6804ad61498d9c9 · Mellanox/rshim · GitHub , the reason for this error is one of the rshim register named ‘RSH_SCRATCHPAD1’ is written a some magic value indicator of some other rshim backend(probably, usb or pcie_live_fish) is holding the access to DPU.
I suspected, usb backend could be that other backend which competed with pcie to get access to DPU and overwriting SCRATCHPAD1 register on DPU . So, I explicitly disabled usb backend by adding rshim configuration at ‘/etc/modprobe.d/rshim.conf’ to disable usb & pcie_live_fish backends). But disabling other backends also does not reset register and rshim still reports the same error & not proceeding for device register.
Few other things I have tried without success are below:
- Tried to run rshim driver in foreground mode by explicitly passing backend device in debug mode as below.
[root]# rshim -b pcie -d pcie-0000:81:00.2 -i 0 -l 4 -f
Probing pcie-0000:81:00.2
create rshim pcie-0000:81:00.2
another backend already attached
I see the same error here also.
- I restarted MST(Mellanox Software Tools) and tried to reset nvc config using Mellanox config tool.
The intention behind this step was to create network interface tmfifo_net0. But, that didn’t succeed as it requires rshim device entry as pre-requisite.
Any help appreciated.
Thanks,
Ganesh