Rebuilding OFED 1.5.3 for new Linux kernel

I administer a HP cluster of 3 head nodes and 60 compute nodes. All run RHEL 6.2 x86_64 with kernel 2.6.32-220. All nodes have Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] NICs. The nodes are interconnected by a 1GB Ethernet network and an IB network. HP professional services installed RHEL with Mellanox OFED 1.5.3. The HP technician told me that we had to use the Mellanox drivers because our Mellanox hardware wasn’t well supported by the Linux IB drivers. We get our hardware and software support, including the Mellanox NICs, through HP. As of this writing, only one of the head nodes faces the outside world. This is so off-campus users can log in to the login node using ‘ssh’.

So far I’ve held back from installing a lot of software updates, but I’m concerned about the security ramifications of running older patch levels especially of the Linux kernel. A recently announced ‘zero-day exploit’ affecting RHEL 6.2 x86_64 has gotten me even more concerned! I’d like to bring my nodes more up to date, but I see that Mellanox OFED 1.5.3 specifically supports only kernel release 2.6.32-220 with RHEL 6.2. RedHat is currently offering release 2.6.32-358.6.2.

Questions:

  1. Are there any known issues with running mlnx_add_kernel_support.sh to build OFED RPMs for RedHat-provided kernels newer than 2.6.32-220?

  2. If I do run into an issue, is there any way I can pursue getting help other than opening a ticket with HP?

  3. How could I figure out whether the native Linux IB drivers support my IB hardware?

My goal: maintain a stable cluster without getting too behind on critical and security patches.

Thanks!

Dave

hi

we face a similar problem. The only way to solve it is to run a pre-prod cluster. install the updates and test it. (takes a few hours)

we use HP C7000 blades with HP BL460 and 465 G7 Servers using ConnectX2 Mezz cards. Firmware Vers HP 2.7 (latest is mellanox 2.9 but we cant get that to work on HP Mezz cards… yet)

the pre-prod cluster is 2 node of older BL460G6 servers with min ram and 1cpu. Its on seperate IB fabric also. +2 more nodes with esx and centos initiators (srp)

Our target servers are on Centos 6.3 atm 2.6.32 using OFED 1.5.3 with sCST 2.2 Its fairly stable. however please note our kernel is custom due to SCST requirements. meaning its not easy upgrade, to rebuild another kernel for each individual machine.

We are considering trying OFED 2 driver, with SCST 3.0 on Ubuntu 12.10

2 good resources are HOWTO: Infiniband SRP Target on CentOS 6 incl RPM SPEC | Andy’s Tech Blog HOWTO: Infiniband SRP Target on CentOS 6 incl RPM SPEC | Andy's Tech Blog

and TechnoNibbles: Installing SCST and SRP for RHEL (Centos/OEL) 6.2, building kernel etc TechnoNibbles: Installing SCST and SRP for RHEL (Centos/OEL) 6.2, building kernel etc

each show what step are taken to rebuild the kernel. we are adopting the 1st blog from Andy, as he rebuilds the RPM’s and can easily test them in pre-prod, and package up the RPM’s to be installed on multiple nodes quickly.

hope this helps?

Thanks! Your links are helpful. The idea of using a non-cluster system to build the OFED RPMs was especially useful. The comments appended to the first link have some interesting opinions regarding OFED and maintenance. Like one of the commenters, I too would be thrilled if the Linux kernel shipped with IB drivers for our hardware.

Starting with a RHEL 6.2 x86_64 VM, I’ve run ‘yum update kernel’ to get the latest kernel. After rebooting, I ran mlnx_add_kernel_support.sh with no reported errors. This built an ISO including kernel-ib, kernel-mft, and knem RPMs for the new kernel. This gives me some confidence that OFED will build OK on my cluster if/when I update the kernel over there.

Testing-wise, I don’t have any spare IB NICs lying around so my least worst option will be to adopt a couple of compute nodes for a few hours. Next, I’d either have to plow ahead with upgrading the rest of the nodes or roll back the updated nodes from an HP CMU image. Testing on a production cluster adds to the challenge, I suppose.

Dave