VBIOS update on H100 (Intel TDX + H100)

Hi, I’m trying to get Intel TDX and H100 cc-mode to work together.I’m trying to get Intel TDX and H100 cc-mode to work together.
“No devices were found” NVIDIA cc with Intel TDX
When setting up the cc environment, I encountered some issues related to the VBIOS version.
The minimum version required according tocc-deployment-guide-tdx.pdf is: 96.00.5E.00.00, whereas mine is: 96.00.30.00.01.

[ 1204.121696] nvidia-persiste[11119]: segfault at 44 ip 00007e96bba08c21 sp 00007ffdbcd8f790 error 6 in libnvidia-cfg.so.550.90.07[7e96bba00000+4d000] likely on CPU 15 (core 15, socket 0)
[ 1204.121708] Code: 00 31 c0 48 81 c4 10 08 00 00 5b 5d 41 5c 41 5d 41 5e c3 66 0f 1f 44 00 00 41 55 41 54 48 8d 57 48 55 53 48 89 fb 48 83 ec 28 47 44 00 00 00 00 8b 77 08 8b 3f e8 8e fe ff ff 85 c0 89 c5 75
[ 1204.123427] [drm] [nvidia-drm] [GPU ID 0x00000016] Loading driver
[ 1204.123433] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:16.0 on minor 0
[ 1209.300006] NVRM: kfspPollForResponse_IMPL: FSP command timed out
[ 1209.300014] NVRM: kfspSendBootCommands_GH100: Sent following content to FSP:
[ 1209.300016] NVRM: kfspSendBootCommands_GH100: version=0x1, size=0x35c, gspFmcSysmemOffset=0x19b780000
[ 1209.300017] NVRM: kfspSendBootCommands_GH100: frtsSysmemOffset=0x0, frtsSysmemSize=0x0
[ 1209.300018] NVRM: kfspSendBootCommands_GH100: frtsVidmemOffset=0x200000, frtsVidmemSize=0x100000
[ 1209.300019] NVRM: kfspSendBootCommands_GH100: gspBootArgsSysmemOffset=0x151229000
[ 1209.300020] NVRM: kfspSendBootCommands_GH100: FSP boot cmds failed. RM cannot boot.
[ 1209.300022] NVRM: kfspDumpDebugState_GH100: GPU 0000:00:16
[ 1209.300023] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(0) = 0x9f
[ 1209.300025] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(1) = 0x10fd12
[ 1209.300026] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(2) = 0x1103c0
[ 1209.300028] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(3) = 0x5
[ 1209.300859] NVRM: nvCheckOkFailedNoLog: Check failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from kfspSendBootCommands_HAL(pGpu, pKernelFsp) @ kernel_gsp_gh100.c:756
[ 1209.301213] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[ 1209.945763] NVRM: GPU 0000:00:16.0: RmInitAdapter failed! (0x62:0x65:1784)
[ 1209.947410] nvidia-uvm: Loaded the UVM driver, major device number 236.
[ 1209.948222] NVRM: GPU 0000:00:16.0: rm_init_adapter failed, device minor number 0
[ 1209.955901] [drm] [nvidia-drm] [GPU ID 0x00000016] Unloading driver
[ 1209.971758] nvidia-modeset: Unloading

However, I still have some doubts:

a. Updating VBIOS seems quite risky, especially for expensive hardware like the H100. Is there a safe method to update the BIOS?(NVIDIA graphics cards differ from AMD graphics cards in not having a dual BIOS design.)

b. Apart from updating the BIOS, are there any other ways to solve this issue?

c. What is the cause of this problem? Is it this specific message that leads you to believe it’s a VBIOS issue: [ 1209.300020] NVRM: kfspSendBootCommands_GH100: FSP boot cmds failed. RM cannot boot.
I’d like to understand the this more detail.

d. If there is a method to solve the problem without updating the BIOS, and I choose to use that method, will I encounter more issues later?

Thanks!

Updating VBIOS seems quite risky, especially for expensive hardware like the H100. Is there a safe method to update the BIOS?(NVIDIA graphics cards differ from AMD graphics cards in not having a dual BIOS design.)

You need to contact the OEM who sold you the H100 and they can assist in the update. This is quite a safe operation when done with the proper tools.

Apart from updating the BIOS, are there any other ways to solve this issue?
(…)
What is the cause of this problem? Is it this specific message that leads you to believe it’s a VBIOS issue: [ 1209.300020] NVRM: kfspSendBootCommands_GH100: FSP boot cmds failed. RM cannot boot.

There is no other workaround. FSP errors listed above are likely due to unknown commands from your outdated VBIOS.

If there is a method to solve the problem without updating the BIOS, and I choose to use that method, will I encounter more issues later?

Unfortunately, no. You must update your VBIOS to enable confidential computing modes on H100

I was wondering why it’s necessary to contact the vendor to update the BIOS of an NVIDIA GPU, rather than reaching out directly to NVIDIA. Could you please clarify?
And Could you please clarify if the tool available on this website is officially endorsed? https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_27be6bdc0e3e47c697e54079e3 Thank you very much!

I was wondering why it’s necessary to contact the vendor to update the BIOS of an NVIDIA GPU, rather than reaching out directly to NVIDIA. Could you please clarify?
And Could you please clarify if the tool available on this website is officially endorsed? https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_27be6bdc0e3e47c697e54079e3 Thank you very much!

Some information that might be helpful: Does NVIDIA H100-SXM 0x2330 support cc? · Issue #67 · NVIDIA/nvtrust · GitHub

I gave it a shot with both HPE and DELL tools to update the VBIOS on my H100 GPU, but it seems like both times I hit a snag—couldn’t get the BIOS info from the GPU.

Any clue what’s up with that?

tools:

https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_8185a55ddd114fc59016b4563d&tab=Installation+Instructions

https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=dthfh

errors:

# cat gpu.log 
Current Information for Nvidia Devices:
Number of Nvidia GPUs present on System : 1

Device Information:
98:00.0 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
 
NVIDIA Firmware Information for all Devices:
 

# cat console.log 

HPE Firmware- Video BIOS ( NVIDIA) flashing Utility Version 1.0 - output files saved in *.log
Gathering System information(Check system.log).......................Complete!
NVIDIA Firmware Information for all Devices:
 
Gathering Nvidia Device information(Check gpu.log).......................Complete!
 
Firmware not found in component for  for index 0
 
Utility is not built for this device. This is for Graphics Device with Board ID 0x03B1 series of cards.

Thank you!

Thanks for the help! I know my H100 GPU’s VBIOS is really old, and right now, I’m having a tough time updating it.

I have a question: Why would the H100 GPU have a VBIOS version that doesn’t support confidential computing features? Were the confidential computing features added later to the H100?

Why does an expired VBIOS lead to the “No device not found” error?
I’m trying to understand the connection between the BIOS and the “No device not found” error, Can anyone explain this for me? as well as the detailed reasons for this error occurring (there’s dmesg information available above). Is it possible that it’s not a VBIOS issue?
Besides reaching out to the supplier, how can we fix this problem by ourselves?
Thank you very much!