Why is my NCCL broken?

@kim.dang check your interfaces to have transport set InfiniBand and link_layer to Ethernet

For the interface with assigned IP (the one used for testing) run ibv_devinfo -d rocep1s0f0 then post the output. It should look like this:

elsaco@spark2:~$ ibv_devinfo -d rocep1s0f0
hca_id:	rocep1s0f0
	transport:			InfiniBand (0)
	fw_ver:				28.45.4028
	node_guid:			4cbb:4703:002d:a85d
	sys_image_guid:		4cbb:4703:002d:a85d
	vendor_id:			0x02c9
	vendor_part_id:		4129
	hw_ver:				0x0
	board_id:			NVD0000000087
	phys_port_cnt:			1
		port:	1
			state:			PORT_DOWN (1)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

Also, why not use the NCCL_SOCKET_IFNAME instead of NCCL_HB_HCA, like in the connection test playbook?

From Environment Variables — NCCL 2.29.1 documentation :

The NCCL_SOCKET_IFNAME variable specifies which IP interfaces to use for communication