Hi, I’m trying to get Intel TDX and H100 cc-mode to work together.I’m trying to get Intel TDX and H100 cc-mode to work together. “No devices were found” NVIDIA cc with Intel TDX
When setting up the cc environment, I encountered some issues related to the VBIOS version.
The minimum version required according tocc-deployment-guide-tdx.pdf is: 96.00.5E.00.00, whereas mine is: 96.00.30.00.01.
a. Updating VBIOS seems quite risky, especially for expensive hardware like the H100. Is there a safe method to update the BIOS?(NVIDIA graphics cards differ from AMD graphics cards in not having a dual BIOS design.)
b. Apart from updating the BIOS, are there any other ways to solve this issue?
c. What is the cause of this problem? Is it this specific message that leads you to believe it’s a VBIOS issue: [ 1209.300020] NVRM: kfspSendBootCommands_GH100: FSP boot cmds failed. RM cannot boot.
I’d like to understand the this more detail.
d. If there is a method to solve the problem without updating the BIOS, and I choose to use that method, will I encounter more issues later?
Updating VBIOS seems quite risky, especially for expensive hardware like the H100. Is there a safe method to update the BIOS?(NVIDIA graphics cards differ from AMD graphics cards in not having a dual BIOS design.)
You need to contact the OEM who sold you the H100 and they can assist in the update. This is quite a safe operation when done with the proper tools.
Apart from updating the BIOS, are there any other ways to solve this issue?
(…)
What is the cause of this problem? Is it this specific message that leads you to believe it’s a VBIOS issue: [ 1209.300020] NVRM: kfspSendBootCommands_GH100: FSP boot cmds failed. RM cannot boot.
There is no other workaround. FSP errors listed above are likely due to unknown commands from your outdated VBIOS.
If there is a method to solve the problem without updating the BIOS, and I choose to use that method, will I encounter more issues later?
Unfortunately, no. You must update your VBIOS to enable confidential computing modes on H100
I gave it a shot with both HPE and DELL tools to update the VBIOS on my H100 GPU, but it seems like both times I hit a snag—couldn’t get the BIOS info from the GPU.
# cat gpu.log
Current Information for Nvidia Devices:
Number of Nvidia GPUs present on System : 1
Device Information:
98:00.0 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
NVIDIA Firmware Information for all Devices:
# cat console.log
HPE Firmware- Video BIOS ( NVIDIA) flashing Utility Version 1.0 - output files saved in *.log
Gathering System information(Check system.log).......................Complete!
NVIDIA Firmware Information for all Devices:
Gathering Nvidia Device information(Check gpu.log).......................Complete!
Firmware not found in component for for index 0
Utility is not built for this device. This is for Graphics Device with Board ID 0x03B1 series of cards.
I have a question: Why would the H100 GPU have a VBIOS version that doesn’t support confidential computing features? Were the confidential computing features added later to the H100?
Why does an expired VBIOS lead to the “No device not found” error?
I’m trying to understand the connection between the BIOS and the “No device not found” error, Can anyone explain this for me? as well as the detailed reasons for this error occurring (there’s dmesg information available above). Is it possible that it’s not a VBIOS issue?
Besides reaching out to the supplier, how can we fix this problem by ourselves?
Thank you very much!