Hi, I’m trying to get Intel TDX and H100 cc-mode to work together.I’m trying to get Intel TDX and H100 cc-mode to work together. “No devices were found” NVIDIA cc with Intel TDX
When setting up the cc environment, I encountered some issues related to the VBIOS version.
The minimum version required according tocc-deployment-guide-tdx.pdf is: 96.00.5E.00.00, whereas mine is: 96.00.30.00.01.
a. Updating VBIOS seems quite risky, especially for expensive hardware like the H100. Is there a safe method to update the BIOS?(NVIDIA graphics cards differ from AMD graphics cards in not having a dual BIOS design.)
b. Apart from updating the BIOS, are there any other ways to solve this issue?
c. What is the cause of this problem? Is it this specific message that leads you to believe it’s a VBIOS issue: [ 1209.300020] NVRM: kfspSendBootCommands_GH100: FSP boot cmds failed. RM cannot boot.
I’d like to understand the this more detail.
d. If there is a method to solve the problem without updating the BIOS, and I choose to use that method, will I encounter more issues later?
Updating VBIOS seems quite risky, especially for expensive hardware like the H100. Is there a safe method to update the BIOS?(NVIDIA graphics cards differ from AMD graphics cards in not having a dual BIOS design.)
You need to contact the OEM who sold you the H100 and they can assist in the update. This is quite a safe operation when done with the proper tools.
Apart from updating the BIOS, are there any other ways to solve this issue?
(…)
What is the cause of this problem? Is it this specific message that leads you to believe it’s a VBIOS issue: [ 1209.300020] NVRM: kfspSendBootCommands_GH100: FSP boot cmds failed. RM cannot boot.
There is no other workaround. FSP errors listed above are likely due to unknown commands from your outdated VBIOS.
If there is a method to solve the problem without updating the BIOS, and I choose to use that method, will I encounter more issues later?
Unfortunately, no. You must update your VBIOS to enable confidential computing modes on H100
I gave it a shot with both HPE and DELL tools to update the VBIOS on my H100 GPU, but it seems like both times I hit a snag—couldn’t get the BIOS info from the GPU.
# cat gpu.log
Current Information for Nvidia Devices:
Number of Nvidia GPUs present on System : 1
Device Information:
98:00.0 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
NVIDIA Firmware Information for all Devices:
# cat console.log
HPE Firmware- Video BIOS ( NVIDIA) flashing Utility Version 1.0 - output files saved in *.log
Gathering System information(Check system.log).......................Complete!
NVIDIA Firmware Information for all Devices:
Gathering Nvidia Device information(Check gpu.log).......................Complete!
Firmware not found in component for for index 0
Utility is not built for this device. This is for Graphics Device with Board ID 0x03B1 series of cards.
I have a question: Why would the H100 GPU have a VBIOS version that doesn’t support confidential computing features? Were the confidential computing features added later to the H100?
Why does an expired VBIOS lead to the “No device not found” error?
I’m trying to understand the connection between the BIOS and the “No device not found” error, Can anyone explain this for me? as well as the detailed reasons for this error occurring (there’s dmesg information available above). Is it possible that it’s not a VBIOS issue?
Besides reaching out to the supplier, how can we fix this problem by ourselves?
Thank you very much!
Hello, have you solved this problem? I have also been trying to make Intel TDX and H100 cc mode work together recently, and my VBIOS version is 96.00.30.00.01. In addition, the nvidia-driver-550-server-open installed according to the deployment manual has been reporting an error after the restart: “The nvidia gpu installed in this system is not supported by the nvidia 550.127.08 driver release.” Is this problem caused by the VBIOS version?
No, I haven’t fixed the VBIOS issue yet. According to the information in the Nvidia manual, your VBIOS version is too low.
“The nvidia gpu installed in this system is not supported by the nvidia 550.127.08 driver release.”
I can’t say for sure if that’s the reason. You could start a new posting so more people can join in on discussing this issue. In my debugging process, when the Host doesn’t enable the cc state, it can install the nvidia-driver-535 just fine. But once the Host enables the cc state, it can’t install the nvidia-driver-550 anymore. As far as I know, nvidia-driver-550 and above introduced the confidential compute module.
I’m guessing it’s because the nvidia-driver-550 detected that your H100 doesn’t meet all the requirements to enable CC., which is causing this problem. If there’s more error info, we can talk about it further. Hope this helps you out.