Dell PE R730xd Fans running too fast because of ConnectX-3 PCIe card

Hello,

Recently we bought a DELL PowerEdge R730xd with a Mellanox ConnectX-3 PCI card inside (MCX353A-FCBT). Without any load, each of the 6 server fans runs at about 17000 RPM !!

By removing the ConnectX-3 card, the fans speed drops to ~5000 RPM, which seems more normal according to what we have always observed in other computing nodes (even with ConnectX-3 cards).

I contacted Dell support which finally told me it’s normal because they are not able to fine tune cooling response for such cards, or something like that. I think that the main argument I found is given here Dell PowerEdge Server R7XX Series Fan Speed with GPU with this comment from Dell people : “[…] Dell PowerEdge servers have the cooling capacity to support a broad array of PCI adapter cards. For PCI cards that are designed or qualified by Dell, the response is optimized while for third-party installed cards the response cautions on the side of more cooling. Many of these third-party cards do not have active thermal sensor monitoring or standard sensor reading topologies hence limiting our ability to fine tune cooling response for such cards.”

As far as I understood, connectX-3 firmware is providing temperature sensors since version 2.40.5030, am I right ? I’ve got version 2.40.5048, so I should have those temperature sensors available, right ?

My first questions are :

  • how can I get values of those sensors ? IPMI ?

  • Can we get the current temperature by this way ?

I also installed MFT to get the current temperature of the ConnectX card, the value returned by mget_temp is 45 (Fans at 17000 RPM). I guess this is in °C, right ?

What is the normal value I can expect ? The maximum value I should not overpass ?

Thanks a lot for any help

Best regards

Pierre

Hmm, I have not used the Dell provided cards, but its a shame to hear they have more compatibility issues with Dell servers then the standard Mellanox cards. I assume you have already verified the FW is up to date. Unfortunately I do not have any other advise for you.

Hi Grant,

Back to the office, I’ve just made some tuning tests (that is, all the different system profiles). Nothing works… Each time, I felt like a jet was taking off in my office.

The only way I found to make the server silent (fans at 5000 RPM) is to get the IB card out of the server. This Mellanox card is Dell provided.

I also checked with ol ConnectX IB card taken into a Dell C6100, I’ve got the same problem in the R730xd (not in the C6100).

Pierre

Hi all,

As recommended, I contacted again Dell support. For Dell, there is no trouble, for this kind of PCIe card, the fans must run at 100%… As my consolation prize, it was proposed to add a warning message on their configurator when such a card can cause excessive ventilation.

Anyway, thank all of you for the help.

Pierre

Hi Grant, I will double-check the bios settings as soon as I’m back to the office. Thanks for the suggestion. Pierre

Are those cards Dell provided? I ran Connectx-3 and now Connectx-4 cards in a Dell r730xd. They were the standard Mellanox ones, non Dell provided. I would assume if they worked the Dell provided ones should also work. The only time i saw the fans run 100% all of the time was when I set the system performance profile to max. Setting it to OS or System controlled worked. Also disabling some of the cstates may have caused the issue as well. I would double check your bios settings, I believe if you follow the Mellanox tuning guide to a T you will end up with the fans at full blast.

Hi,

mget_temp is the utility to obtain device temperature. Typical range is 0-55C, for specific card you might check out website for thee documentation. Here is the example for ConnectX-3 http://www.mellanox.com/related-docs/user_manuals/ConnectX-3_Ethernet_Single_and_Dual_QSFP+_Port_Adapter_Card_User_Manua… http://www.mellanox.com/related-docs/user_manuals/ConnectX-3_Ethernet_Single_and_Dual_QSFP+_Port_Adapter_Card_User_Manual.pdf

Thanks Alkx for your answer, as you suggested I checked the range on the documentation of my card (Mellanox Products: ConnectX®-3 Single/Dual-Port Adapter with VPI http://www.mellanox.com/page/products_dyn?product_family=119&mtag=connectx_3_vpi ), you’re right the (operational) range is 0-55°C.

If the utility “mget_temp” returns the device tempreature, it means that there is sensor for that, right ? In such a case, I would be surprised that Dell did’nt make use of this sensor to regulate the cooling of the server.

With the 6 server fans at 17000 RPM (which means at 100%), the temperature returned by mget_temp is 45°C. I’m curious to know what would be the device tempreature if the server fans speed was less (ex.: 5000 RPM). But,unfortunately, I’m not able to deactivate the over-cooling triggered by this additional PCI card on the server, it’s always ON. If i was able to activate/deactivate this over-cooling, I could regulate it by myself according to the device temperature.

Anyway, thanks again Alkx for you answer.

Best regards

Pierre

Hi Grant,

Yes, I checked the FW also. Anyway, thanks for your help/advise. I will contact again Dell support to escalate the issue.

Pierre please open case and escalate via Dell support. AFAIK if the card is indeed DELL branded it should have integration for fan control with the Dell server, as well as many other things. That’s one of the value add of Dell (and other server OEM) branded solutions.

I have the same issue with my r730xd, where I put in a Mellanox ConnectX-3 card and the r730xd fans go to 98%. Did Dell support give you anything to alleviate this?

  1. Have you taken it with Dell as suggested earlier on this thread?
    They did reply the user who raised this case to the forum.

  2. From the following link:

https://mymellanox.force.com/mellanoxcommunity/s/question/0D51T00006RVuctSAD/dell-pe-r730xd-fans-running-too-fast-because-of-connectx3-pcie-card

You will see the resolution of this issue with the customer reporting Dell see this fan operation as normal with PCIe cards in the likes of ConnectX-3 on the rx730 platform.
Here’s the relevant excerpt (last post is the relevant one)

[//cdck-file-uploads-global.s3.dualstack.us-west-2.amazonaws.com/nvidia/original/3X/2/b/2b9425e704e3616888f6cf3bc21a89b08fd2d5ec.png]

Hi TomP

Indeed, this is not considered a problem by Dell, and the server has
been running like this for over 5 years.

Best regards

Pierre