HP z820 doesn't recognize K20c [Solved]

I have an HP z820, dual processor, the latest BIOS J63 v03.65. It’s running RHEL 6.5, has Cuda 5.5 and Nvidia driver 331.49 installed. It has a K600 GPU for display in slot 6 and a K20c GPU for compute in slot 2. Both GPUs are recognized by lspci and the nvidiainit.sh script creates /dev/nvidia0 and /dev/nvidia1 devices for them. But when I run deviceQuery, only the K600 is recognized. The K20c is not recognized.

If I swap out the K20c for a T2075, then the K600 and T2075 are both recognized.

If I take the K20c to my z420, then that computer recognizes its Quadro 600 and the K20c.

Can you post the output of nvidia-smi with the K20c and K600 on the z820? It sounds like a bad power connection to the K20c on that machine.

% /sbin/lspci | grep -i nvidia
05:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20c] (rev a1)
06:00.0 VGA compatible controller: NVIDIA Corporation GK107GL [Quadro K600] (rev a1)
06:00.1 Audio device: NVIDIA Corporation GK107 HDMI Audio Controller (rev a1)

% nvidia-smi

Tue Mar 25 13:34:38 2014       
| NVIDIA-SMI 331.49     Driver Version: 331.49         |                       
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Quadro K600         Off  | 0000:06:00.0     Off |                  N/A |
| 25%   50C    P0    N/A /  N/A |      5MiB /  1023MiB |      0%      Default |
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|  No running compute processes found                                         |

% nvidia-smi -L
GPU 0: Quadro K600 (UUID: GPU-b919d4f4-a313-a963-13cb-b6715b4160e1)

The K20c has two 6-pin power cables plugged into it; those same two plugs fed the C2070 okay.

Try swapping the K20c with the K600 (slot-wise) and see if you get any difference. Not sure what the culprit would be otherwise…

I tried various combinations of slots; none of them worked.

  • K600 slot 1 ; K20c slot 2
  • K600 slot 6 ; K20c slot 2
  • K600 slot 6 ; K20c slot 4 (z820 has the dual-CPU option)
  • K600 slot 2 ; K20c slot 4

HP has a video BIOS update for the K20c http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/advisoriesDisplay?javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken&javax.portlet.prp_efb5c0793523e51970c8fa22b053ce01=wsrp-navigationalState%3DdocId%253Demr_na-c03802326-1%257CdocLocale%253Den_US&javax.portlet.tpst=efb5c0793523e51970c8fa22b053ce01&sp4ts.oid=5225037&ac.admitted=1396287700914.876444892.492883150 that updates the video bios to version It addresses the following:

  • This update addresses low voltage level interpretation as “device not present” in the Blade G8 workstation
  • This update allows 2 x K20s + K5000 in Z820 without having thermal issues

I’m a little leery to flash a new video bios on a $3,000 GPU card since this doesn’t seem to directly address my situation. I don’t want to brick it by accident.

There’s no way to get a double-width K20c in slot 6 of the z820. The various cables that plug into the bottom of the motherboard just don’t leave enough room.

I got a DMM and measured the 3 6-pin PCI-e power connectors and they all showed +12V going to them. So power looks good.

Have you tried going through your hardware support channels (e.g. HP directly)? It could very well be that updating the video BIOS will fix the issue, the description almost looks like what you are experiencing… that being said it’s for a different model system.

Edit: They also advocate that update for your system just for ‘cooling’:

If the video BIOS on your card is older, give it a shot.

I got a second level support person from HP assigned to this issue. I’ll update this thread with any information I learn.

It fails for me with RHEL 6.4 (2.6.32-358) and RHEL 6.5 (2.6.32-431.5.1) on my z820. It doesn’t fail for me with Ubuntu 12.04 on my z820. Since it works for me with Ubuntu, it shows the K20 is good and doesn’t need the video BIOS upgrade, and the hardware in my z820 is good.

It doesn’t fail when HP tried it with RHEL 6.5 (unknown patch level) on their z820, cards in the same slots. So there’s something subtly different between my RHEL 6.5 setup and HP’s RHEL 6.5 setup. I can’t run Ubuntu at work, so I have to spend some more time figuring out what’s going on with RHEL 6.5.

What about the RedHat 7 beta? Is that work approved? Does RHEL 6.4 or 6.5 have an updated kernel? I’m assuming these are legacy BIOS/CSM installs… Was Ubuntu a UEFI or CSM install? That’s my ideas so far… but I suspect it might be kernel related… 2.6 kernels are gosh… very old now.

Problem solved: blacklist nvidiafb. Add this to blacklist.conf and the kernel boot line, and the K20c and K600 are both recognized by deviceQuery, and my GPU application works as well. Strange that nvidiafb didn’t interfere with the K600 or the C2070, but perhaps the lack of a video output port on the K20c has something to do with it.

Vacaloca, thanks for your suggestions on this thread. Thanks also to Ian at HP who got me the output of lsmod with his working configuration, which pointed me in the right direction.

P.S: RHEL 7 beta? That’s not going to happen at work, we’ll probably wait for RHEL 7.1 before making it available internally so someone else gets to shake those bugs out :-)