Tesla P4 - HP DL385 Gen10 - ESXi 6.5 showing no ram available on Telsa P4

Hi there, We have 3 x Tesla P4 cards we are wanting to use on ESXi 6.5 for vSGA use. I installed the first 2 on HP DL380 Gen9 server with Intel CPU’s and installed the NVIDIA-VMware-418.196-1OEM.650.0.0.4598673.x86_64.vib file and everything is working fine with those 2 servers. We are running ESXi V 6.5.0 build 17477841 on all 3 hosts.

The issue I have is with the 3rd host which is HP DL385 with AMD CPU. I’ve tried installing the same driver and it shows up in vCenter. It shows the card, it shows the Active type ‘shared’ and configured type ‘shared’ but shows memory as 0.00mb.

Also if I run nvidia-smi it shows it says no devices were found even though the bios and ESXi show it’s there. So it’s obviously something to do with the drivers not loading etc but I don’t know why.

I’m clearly missing something here.

I’ve also tried updating the driver to the latest version I could download NVIDIA-VMware-460.73.02-1OEM.650.0.0.4598673.x86_64.vib but it hasn’t changed anything.

I’m losing my mind with this one…any help is very much appreciated.

Can you run nvidia-smi on the ESX host after installing the VIB? What is the output? Seems the GPU is not recognized properly. Maybe you need to check the BIOS settings first.

Yes I have run it, as mentioned in my original post. It just says ‘no devices were found’ but clearly it is seen by the bios as it’s shown in iLO, BIOS and in ESXi …

That’s why I’m confused.

check with dmesg on the host if there are errors, I still assume a wrong BIOS setting

There’s a whole load of stuff in there when I run that command…not sure what I should be looking for, but this stuff seems GPU related.

2021-07-06T09:54:40.203Z cpu44:72073)NVRM: GPU at 0000:23:00.0 has software scheduler DISABLED with policy BEST_EFFORT.
2021-07-06T09:54:40.217Z cpu44:72073)NVRM: GPU 0000:23:00.0: RmInitAdapter failed! (0x26:0xffff:1290)

Any idea what sort of settings in the BIOS I should be looking for ?? I checked the IO-SRV in virtualisation section and that’s enabled. Well it’s greyed out so I can’t select it, but it says enabled in grey.

As you can see the board is not loaded properly. Might be BIOS related or hardware defect. Please check with HPE first for the right BIOS setting. Especially the MMIO settings are relevant. P4 doesn’t require SR-IOV enabled. In addition, I’m not sure if P4 is qualified for the given server at all. As far as I can see only T4 was validated.

Yeah I figured out that something like that was going on, but have been through all of the setting in the BIOS and there’s nothing at all that I can see that’s related. I just assumed if they worked fine with Gen9 of the same server that the Gen10 would be fine… I know that’s not always the case, but wouldn’t of expected it to not work like this, but wasn’t sure if I needed to get advice from HP or NVIDIA or VMWare so I’ll see what I can get from HP.

Thanks for the advice.

OK so annoyingly HP says it’s not supported on this server which is really dumb… and frustrating. But at least if someone else is looking for this information it’s here now.

Thanks for your help sschaber

OK so I have a bit of a wrinkle in this story. Today I removed the card from the HP DL386 Gen10 and put it into the Cisco UCS C210 M2 server we had to see if it worked there and got exactly the same issue…everything looks fine in the sense that the BIOS reports it, VMWare see’s it, but says it’s 0 Mb of RAM.

nvidia-smi reports no devices available.

It seems a bit to coincidental that both servers are doing the same thing… I know I have read that some of the Tesla cards are selectable between modes but I’m not sure if the Tesla P4’s are like this and maybe the card is in the wrong mode ? I haven’t been able to find any information about that with the P4 but just thought I’d ask the question in case I’m missing something there and I’m fighting a losing battle if the card isn’t going to respond properly.

Any advice is appreciated as always. I’ve got some other later model Cisco servers I’m going to give the card and go in to see if I get similar results.

Have you tried to use ESX 6.7 instead? Why do you still use 6.5? Modeswitch is not possible on the P4 as it handles graphics and compute in parallel.
Did you open a support ticket with our NVES? They could analyze the nvbugreport to see if points to the issue.
And please keep the “old” 418.x driver for your testing as I doubt the latest one is working with 6.5 due to extended VIB size. You even need a current patch for 6.7 to extend the VIB size accepted and VMWare didn’t release a patch for 6.5 as far as I know.

regards
Simon

Hi Simon,

I used ESXi 6.5 because I understood that it was the only version that officially supported the VSGA function without some kind of licencing. As I said, we have 2 other servers with the same card, running the same version of driver but with DL380 Gen9, not DL385 Gen10 but hard to know if it’s a card issue or something else going on as I’m not familiar enough with these cards and ESXi to diagnose.

I haven’t done anything else other than post here as I wasn’t aware of any other options sorry.

Thanks for the info on the modeswitch, I did assume it couldn’t, but thought maybe it was a reason the card might not seem to be working properly, but at least if we know it’s not possible then at least I know it’s not that !

I am going to use another server to do some more trialing with different drivers, ESXi etc etc to see if I get any different results. I couldn’t really do too much on our production servers, but now I have some other servers to prove if the card is or isn’t working properly. That’s my first step I guess.

Even when I installed the latest driver I had, ESXi didn’t complain and said it was installed successfully I thought, but I better read it again just to be super sure.

Unfortunately you are wrong. ESX version is not relevant for licensing. You always need a vPC license for vSGA as soon as you use a GPU like P4.

Thanks for letting me know that. The licencing model is super complicated I found when I tried to look up what was needed. Found plenty of posts in other places with very confused I.T staff also, so was obvious it wasn’t just me struggling to understand it.

I found a driver for ESXi 6.5 from before NVIDIA had their vPC licencing so that’s why I assumed it didn’t need any kind of licencing for that version. I know I needed some kind of licencing for the later versions (which I ended up needing to install to get things going on the other servers) so once we get all three going we will get whatever we need to make us legal.

So I haven’t got anywhere with running the card on seperate servers or seperate O/S’s.

I’m just wondering if it’s worth me attempting a bios / firmware flash to make sure it’s not something like that ? or is there some process so we can diagnose further ?

I can find a BIOS but it’s for SUSE or some other Linux, so wondered if there’s some other easier method than going through setting all that up just to flash it again ?

It’s really frustrating…

Do you mean flashing the GPU? Doesn’t make any sense as this is never necessary.

regards
Simon

Yes. I mean I know it’s not normally needed, but obviously in this situation it’s not working as expected so I just thought maybe it could be something like that since it shows up in BIOS and in ESXi, but the drivers won’t initialise in ESXi or Windows… so guess I was clutching at straws a bit.

how can I diagnose further what might be happening ?

Just wanted to let everyone know that might be having similar issues that this was down to a faulty card.

I went through everything over and over and have had working cards in other servers with no drama’s so I knew what it should be doing.

Finally after exhausting all options I contacted the supplier and got a return / replacement and it worked straight off once it showed so clearly it was some kind of fault, but it showed in BIOS etc and in VMWare partially as mentioned… but now I know it was a faulty card.

Thanks for everyone that replied.