A more reliable way to update BMC firmware ERoT

Hi Nvidia:
We are working on building our system, based on IGX Orin board kit, IGX OS 1.0.3.
Our first step is to update BMC firmware. But, like the topic said, it’s pretty in-reliable.

We were using curl to updae the BMC firmware, like what we did before. In our experience, we should first use curl to login and start the firmware updating process, then use curl to check if the updating task is completed.
In our experience, 30~40 mins is long enough to update the firmware after we start the process. That means, after the percent complete of updating task reached 100, we still give it a little more time to make it stable. But, even so, it’s still not a reliable way to update it.

Here are my questions:

  1. How to flash the older BMC firmware? We tried the way we used to downgrade BMC firmware to 23 (IGX SW 1.0 DP version), but the task state doesn’t act right.
    Using curl to update firmware:
curl -k -H "X-Auth-Token:$token" -H "Content-Type: application/octet-stream" -X POST -T `pwd`/cec1736-apfw-11062023.fwpkg 
https://${bmc}/redfish/v1/UpdateService                                         
{                                                                               
  "@odata.id": "/redfish/v1/TaskService/Tasks/1",                               
  "@odata.type": "#Task.v1_4_3.Task",                                           
  "Id": "1",                                                                    
  "TaskState": "Running",                                                       
  "TaskStatus": "OK"                                                            
}

Task state:


{
  "@odata.id": "/redfish/v1/TaskService/Tasks/1",
  "@odata.type": "#Task.v1_4_3.Task",
  "EndTime": "2024-08-05T06:38:17+00:00",
  "Id": "1",
  "Messages": [
    {
      "@odata.type": "#Message.v1_0_0.Message",
      "Message": "The task with id 1 has started.",
      "MessageArgs": [
        "1"
      ],
      "MessageId": "TaskEvent.1.0.1.TaskStarted",
      "Resolution": "None.",
      "Severity": "OK"
    },
    {
      "@odata.type": "#MessageRegistry.v1_4_1.MessageRegistry",
      "Message": "Transfer of image '0.0' to '' failed.",
      "MessageArgs": [
        "0.0",
        ""
      ],
      "MessageId": "Update.1.0.TransferFailed",
      "Resolution": "Debug Token Service is not ready, retry the firmware update operation after the management controller is ready. If the issue still persists reset the baseboard.",
      "Severity": "Critical"
    },
    {
      "@odata.type": "#MessageRegistry.v1_4_1.MessageRegistry",
      "Message": "The target device 'BMC_FW_AST2600_0' will be updated with image 'cec1736ApFw-09022024'.",
      "MessageArgs": [
        "BMC_FW_AST2600_0",
        "cec1736ApFw-09022024"
      ],
      "MessageId": "Update.1.0.TargetDetermined",
      "Resolution": "None.",
      "Severity": "OK"
    },
    {
      "@odata.type": "#MessageRegistry.v1_4_1.MessageRegistry",
      "Message": "Image 'cec1736ApFw-09022024' is being transferred to 'BMC_FW_AST2600_0'.",
      "MessageArgs": [
        "cec1736ApFw-09022024",
        "BMC_FW_AST2600_0"
      ],
      "MessageId": "Update.1.0.TransferringToComponent",
      "Resolution": "None.",
      "Severity": "OK"
    },
    {
      "@odata.type": "#MessageRegistry.v1_4_1.MessageRegistry",
      "Message": "Verification of image 'cec1736ApFw-09022024' at 'BMC_FW_AST2600_0' failed.",
      "MessageArgs": [
        "cec1736ApFw-09022024",
        "BMC_FW_AST2600_0"
      ],
      "MessageId": "Update.1.0.VerificationFailed",
      "Resolution": "None.",
      "Severity": "Critical"
    },
    {
      "@odata.type": "#Message.v1_0_0.Message",
      "Message": "The task with id 1 has changed to progress 100 percent complete.",
      "MessageArgs": [
        "1",
        "100"
      ],
      "MessageId": "TaskEvent.1.0.1.TaskProgressChanged",
      "Resolution": "None.",
      "Severity": "OK"
    },
    {
      "@odata.type": "#Message.v1_0_0.Message",
      "Message": "The task with id 1 has been aborted.",
      "MessageArgs": [
        "1"
      ],
      "MessageId": "TaskEvent.1.0.1.TaskAborted",
      "Resolution": "None.",
      "Severity": "Critical"
    },
    {
      "@odata.type": "#MessageRegistry.v1_4_1.MessageRegistry",
      "Message": "The resource property 'BMC_FW_AST2600_0' has detected errors of type 'SKU mismatch'.",
      "MessageArgs": [
        "BMC_FW_AST2600_0",
        "SKU mismatch"
      ],
      "MessageId": "ResourceEvent.1.0.ResourceErrorsDetected",
      "Resolution": "Verify the contents of the FW package",
      "Severity": "Critical"
    }
  ],
  "Name": "Task 1",
  "Payload": {
    "HttpHeaders": [
      "Host: 192.168.1.110",
      "User-Agent: curl/7.81.0",
      "Accept: */*",
      "Content-Length: 67105977"
    ],
    "HttpOperation": "POST",
    "JsonBody": "null",
    "TargetUri": "/redfish/v1/UpdateService"
  },
  "PercentComplete": 100,
  "StartTime": "2024-08-05T06:38:17+00:00",
  "TaskMonitor": "/redfish/v1/TaskService/Tasks/1/Monitor",
  "TaskState": "Exception",
  "TaskStatus": "Critical"
}

We want to downgrade BMC firmware, then update it again so that we can reproduce the issue.

  1. Firmware for Non-ERoT can be updated using initramfs. Why can’t the one for ERoT do so?
    It seems like initramfs can be a more reliable way to update the firmware, but just for Non-ERoT. What will happen if we update the firmware for ERoT using initramfs?

If more info is needed, please let us know.
If this is actually the most reliable way, please also let us know.

Many Thanks.

Hi jameskuo,

Starting from IGX OS SW1.0, we would encourage customer just using WebUI to update BMC firmware no matter for ERoT or Non-ERoT firmware.

As my understanding, you can use either curl method or WebUI method to update BMC firmware of ERoT.
Do you think the update process take too much time to update so that you are looking for other methods?

Hi Kevin:
Thanks for the reply.

When we receive the kit, we are not able to open the WebUI on BMC, so we must use curl to update the firmware instead of the WebUI.
Can we take this as a promise that BMC module will be installed with a firmware that supports WebUI in future?

If it’s a reliable way to update it, 30 min is acceptable. But the thing is, if the updating process is failed, it took 30 more mins to try to update it again. Considering the fail rate is pretty high ( my collage told me that only 1 / 4 is one shot, others are two or more), this is going to be an issue.

Many thanks.

Hi Kevin:
We just receive a new set of board kit.
After changing the BMC password, before we flash the BMC firmware, we tried to access the BMC module WebUI. It showed:

I think it pretty strange since no login screen is shown, we can’t enter the username or password.
Do we use it in a wrong way?

Many Thanks.

I’ve checked this with internal that initramfs way of updating will break the BMC and make it inaccessible.
I’ve also confirmed that the new BMC firmware will have the webUI interface and the new board is started to be installed with new firmware.
For more logs, you can check the “journactl -f” after login into BMC for the image update.

I’ve not seen this screen. What’s your current BMC firmware?

Hi Kevin:
Thanks for the reply.

Thanks. We will never do that.

Sure thing. Once we get the kit with the new BMC firmware preinstalled, we will test it again.

It’s GraceBMC-23.05-1-rc1, which is also the pre-installed firmware.
We were trying to make sure that the firmware doesn’t support webUI. Seems like the answer is No, this one does not support web ui. We will just flash it.

Many Thanks.

Yes, it is the old firmware which does not support WebUI.