Hello,
I have questions regarding PSCI CPU_ON and possible RAS Uncorrectable errors.
Here is the current system I am running:
- Jetson Orin Nano Developer Kit 8GB
- MB1 (version: 1.4.0.4-t234-54845784-e89ea9bc)
- MB2 (version: 0.0.0.0-t234-54845784-22833a33)
- BL31: v2.8(release):e12e3fa93
- OP-TEE version: 4.2 (gcc version 11.3.0 (Buildroot 2022.08)) #2 Wed Jan 8 01:24:03 UTC 2025 aarch64
- Jetson UEFI firmware (version 36.4.3-gcid-38968081 built on 2025-01-08T01:18:20+00:00)
- Linux version 5.15.148-tegra (buildbrain@mobile-u64-6336-d8000) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 202) installed on a NVME drive
The system is installed with the Nvidia SDK manager.
I am developing a simple hypervisor that is loaded with an UEFI application
before starting the OS. The current loading flow is:
- Boot into UEFI shell.
- Run the UEFI application located in a USB drive that jumps into the
hypervisor initialization code (still EL2 at this point). - After the hypervisor initialization is done, it sets up EL1 environment and
jump back to the UEFI loader in EL1. - The control is now returned to the UEFI application (EL1 at the point).
- The UEFI application exits back to the UEFI shell.
- Start the OS.
The hypervisor mainly sets up:
- MMU Stage 2 translation (1-to-1 intermediate physical address to real physical address excepted memory reserved by the hypervisor)
- Virtual GIC
- SMC call trapping (Need to handle PSCI CPU_SUSPEND and PSCI CPU_ON)
The OS can boot to the login screen most of the time. However, there are times when
RAS Uncorrectable error occurs:
...
EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path
EFI stub: Exiting boot services...
***RRORR OR: e***o* r*a****1*s*ndr*m*=**82*0001*
ERROR: RAS Uncorrectable Error in IOB, base=0xe010000:
ERROR: Status = 0xe4000612
ERROR: SERR = Error response from slave: 0x12
ERROR: IERR = CBB Interface Error: 0x6
ERROR: MISC0 = 0xc04a0040
ERROR: MISC1 = 0x24c1844000000000
ERROR: MISC2 = 0x0
ERROR: MISC3 = 0x0
ERROR: ADDR = 0x8000000000000000
ERROR: **************************************
ERROR: sdei_dispatch_event returned -1
ERROR: **************************************
ERROR: RAS Uncorrectable Error in ACI, base=0xe01a000:
ERROR: Status = 0xe8000904
ERROR: SERR = Assertion failure: 0x4
ERROR: IERR = FillWrite Error: 0x9
ERROR: Overflow (there may be more errors) - Uncorrectable
ERROR: ADDR = 0x8000000000000000
ERROR: **************************************
ERROR: sdei_dispatch_event returned -1
ERROR: Excepiio off core
syndrome=0x82000014
ERROR: **************************************
ERROR: RAS Uncorrectable Error in IOB, base=0xe010000:
ERROR: Status = 0xe4000612
ERROR: SERR = Error response from slave: 0x12
ERROR: IERR = CBB Interface Error: 0x6
ERROR: MISC0 = 0xc0424040
ERROR: MISC1 = 0x741844000000000
ERROR: MISC2 = 0x0
ERROR: MISC3 = 0x0
ERROR: ADDR = 0x8000000000000000
ERROR: **************************************
ERROR: sdei_dispatch_event returned -1
ERROR: **************************************
ERROR: RAS Uncorrectable Error in ACI, base=0xe01a000:
ERROR: Status = 0xe8000904
ERROR: SERR = Assertion failure: 0x4
ERROR: IERR = FillWrite Error: 0x9
ERROR: Overflow (there may be more errors) - Uncorrectable
ERROR: ADDR = 0x8000000000000000
ERROR: **************************************
ERROR: sdei_dispatch_event returned -1
ERROR: Powering off core
I tried to track down where the problem occurs, and found that it seems to be
from issuing PSCI CPU_ON call.
Since the hypervisor traps SMC calls from the guest. It is responsible for
issuing SMC calls on behalf of the guest. The flow of issuing PSCI CPU_ON is
the following:
- Trap SMC call from the guest.
- The hypervisor arranges its own PSCI CPU_ON call with a reference to the
guest’s physical entry address. - The hypervisor issues the PSCI CPU_ON call.
- PSCI CPU_ON call returns, and the hypervisor returns the guest finally.
- By the time a secondary CPU core starts, the hypervisor initializes itself,
and jumps to the guset’s physical entry address in EL1.
I add the following patch to the hypervisor code for debugging. Note that
printf()
outputs texts to the UART serial.
diff --git a/core/aarch64/smc.c b/core/aarch64/smc.c
--- a/core/aarch64/smc.c
+++ b/core/aarch64/smc.c
@@ -194,6 +194,8 @@ handle_psci_cpu_on (union exception_save
g->pa_base = vmm_mem_start_phys ();
g->va_base = vmm_mem_start_virt ();
+ printf ("Calling CPU_ON\n");
+
/* Check for error from SMC call */
error = smc_asm_psci_call (r->reg.x0, r->reg.x1,
sym_to_phys (entry_cpu_on),
@@ -205,6 +207,8 @@ handle_psci_cpu_on (union exception_save
free (stack);
}
+ printf ("Done Calling CPU_ON\n");
+
/* Return error to the guest */
r->reg.x0 = error;
}
On successful cases, output from the serial looks like the following:
...
Starting a virtual machine...
Processor 0 entering EL1
Shell> fs3:EFI\BOOT\BOOTAA64.efi
L4TLauncher: Attempting Direct Boot
EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path
EFI stub: Exiting boot services...
Calling CPU_ON # Output from the hypervisor
Done Calling CPU_ON # Output from the hypervisor
Processor 100 entering EL1 # Output from the hypervisor
Calling CPU_ON # Output from the hypervisor
Done Calling CPU_ON # Output from the hypervisor
Processor 200 entering EL1 # Output from the hypervisor
Calling CPU_ON # Output from the hypervisor
Done Calling CPU_ON # Output from the hypervisor
Processor 300 entering EL1 # Output from the hypervisor
Calling CPU_ON # Output from the hypervisor
Done Calling CPU_ON # Output from the hypervisor
Processor 10200 entering EL1 # Output from the hypervisor
Calling CPU_ON # Output from the hypervisor
Done Calling CPU_ON # Output from the hypervisor
Processor 10300 entering EL1 # Output from the hypervisor
��debugfs initialized
��I/TC: Reserved shared memory is disabled
I/TC: Dynamic shared memory is enabled
I/TC: Normal World virtualization support is disabled
I/TC: Asynchronous notifications are disabled
��[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd421]
[ 0.000000] Linux version 5.15.148-tegra (buildbrain@mobile-u64-6336-d8000) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 202)
[ 0.000000] Machine model: NVIDIA Jetson Orin Nano Engineering Reference Developer Kit Super
[ 0.000000] efi: EFI v2.70 by EDK II
[ 0.000000] efi: RTPROP=0x26d82f198 TPMFinalLog=0x25e3f0000 SMBIOS=0xffff0000 SMBIOS 3.0=0x26d220000 MEMATTR=0x266cc6018 ESRT=0x267
[ 0.000000] random: crng init done
...
When an error occurs it looks like the following
Starting a virtual machine...
Processor 0 entering EL1
Shell> fs3:EFI\BOOT\BOOTAA64.efi
L4TLauncher: Attempting Direct Boot
EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path
EFI stub: Exiting boot services...
Calling CPU_ON # Output from the hypervisor
Done C��ERRORRRORE ce**i*n**e*****1 *y*****e**x******1***** # Output from the hypervisor is interrupted
ERROR: RAS Uncorrectable Error in IOB, base=0xe010000:
ERROR: Status = 0xe4000612
ERROR: SERR = Error response from slave: 0x12
ERROR: IERR = CBB Interface Error: 0x6
ERROR: MISC0 = 0xc05e0040
ERROR: MISC1 = 0x2c1844000000000
ERROR: MISC2 = 0x0
ERROR: MISC3 = 0x0
ERROR: ADDR = 0x8000000000000000
ERROR: **************************************
ERROR: sdei_dispatch_event returned -1
ERROR: **************************************
ERROR: RAS Uncorrectable Error in ACI, base=0xe01a000:
ERROR: Status = 0xe8000904
ERROR: SERR = Assertion failure: 0x4
ERROR: IERR = FillWrite Error: 0x9
ERROR: Overflow (there may be more errors) - Uncorrectable
ERROR: ADDR = 0x8000000000000000
ERROR: **************************************
ERROR: sdei_dispatch_event returned -1
ERROR: Exwertng fe sone1 syndrome=0x82000014
ERROR: **************************************
ERROR: RAS Uncorrectable Error in IOB, base=0xe010000:
ERROR: Status = 0xe4000612
ERROR: SERR = Error response from slave: 0x12
ERROR: IERR = CBB Interface Error: 0x6
ERROR: MISC0 = 0xc052c040
ERROR: MISC1 = 0x2141844000000000
ERROR: MISC2 = 0x0
ERROR: MISC3 = 0x0
ERROR: ADDR = 0x8000000000000000
ERROR: **************************************
ERROR: sdei_dispatch_event returned -1
ERROR: **************************************
ERROR: RAS Uncorrectable Error in ACI, base=0xe01a000:
ERROR: Status = 0xe8000904
ERROR: SERR = Assertion failure: 0x4
ERROR: IERR = FillWrite Error: 0x9
ERROR: Overflow (there may be more errors) - Uncorrectable
ERROR: ADDR = 0x8000000000000000
ERROR: **************************************
ERROR: sdei_dispatch_event returned -1
ERROR: Powering off core
As you can see the error occurs almost immediately after returning from the
PSCI CPU_ON call.
The error can be reproduced by rebooting like 7~15 times. It seems to me
that the error is quite random.
I don’t expect this kind of error from PSCI CPU_ON call. I cannot find any
documentation explaining the relationship between PSCI CPU_ON and
RAS Uncorrectable errors. My questions are:
- What are conditions that this situation can occur?
- Is there a good way to determine whether it is the hypervisor implementation problem or firmware problem?
- Do you have any suggestion that could be a workaround?
If you need more information, please don’t hesitate to ask me.
Best Regards
Ake Koomsin