Please can anyone help with these boot errors? The system is working however it looks like certain kernel subsystems i.e. Energy Aware Scheduling are not fully compatible yet with the GB10’s heterogeneous CPU topology. Output from journalctl -b -p err
kernel: platform NVDA8900:00: failed to claim resource 0: [mem 0xc8000000-0xd7ffffff] kernel: acpi NVDA8900:00: platform device creation failed: -16 kernel: pci 000f:01:00.0: DOE: [2c8] ABORT timed out kernel: pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5 kernel: pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5 kernel: xhci_endpoint_init rsv 0x801 kernel: xhci_endpoint_init rsv 0x801 kernel: xhci_endpoint_init rsv 0x801 kernel: xhci_endpoint_init rsv 0x801 kernel: xhci_endpoint_init rsv 0x801 kernel: xhci_endpoint_init rsv 0x801 kernel: xhci_endpoint_init rsv 0x801 kernel: processor cpu0: EM: CPUs of 0-4,10-14 must have the same capacity kernel: processor cpu1: EM: CPUs of 0-4,10-14 must have the same capacity kernel: processor cpu2: EM: CPUs of 0-4,10-14 must have the same capacity kernel: processor cpu3: EM: CPUs of 0-4,10-14 must have the same capacity kernel: processor cpu4: EM: CPUs of 0-4,10-14 must have the same capacity kernel: processor cpu10: EM: CPUs of 0-4,10-14 must have the same capacity kernel: processor cpu11: EM: CPUs of 0-4,10-14 must have the same capacity kernel: processor cpu12: EM: CPUs of 0-4,10-14 must have the same capacity kernel: processor cpu13: EM: CPUs of 0-4,10-14 must have the same capacity kernel: processor cpu14: EM: CPUs of 0-4,10-14 must have the same capacity kernel: processor cpu15: EM: CPUs of 15-19 must have the same capacity kernel: processor cpu16: EM: CPUs of 15-19 must have the same capacity kernel: processor cpu17: EM: CPUs of 15-19 must have the same capacity kernel: processor cpu18: EM: CPUs of 15-19 must have the same capacity kernel: processor cpu19: EM: CPUs of 15-19 must have the same capacity
@charith987 I haven’t found a resolution for these boot errors. I haven’t noticed specific performance issues, but I imagine CPU core selection may be suboptimal if EAS is disabled.
The reply from @aniculescu is positive so hopefully it’s fixed in the next update. Otherwise I was planning to email the mailing list.
I am still seeing these errors on Dell GB10. Curiously when the system was first configured these errors were not present. I suspect it is a bug in the grub default command line.
my theory until we hear more from @aniculescu or NVIDIA is that this is making things tricky for the OS scheduler.
I just updated my Spark to the latest software and journalctl -p err still reports:
Jan 11 15:59:45 kernel: processor cpu0: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu1: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu2: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu3: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu4: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu10: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu11: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu12: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu13: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu14: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu15: EM: CPUs of 15-19 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu16: EM: CPUs of 15-19 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu17: EM: CPUs of 15-19 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu18: EM: CPUs of 15-19 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu19: EM: CPUs of 15-19 must have the same capacity
and the cpu capacity report is the same as I reported earlier in #2.
@j0n those are kernel warning and not errors! And while they’re annoying there’s nothing to worry about them. It looks like a firmware bug and some ACPI tables need updating. The Spark is using also an older kernel, the 6.14.x branch, while the latest Linux stable kernel is 6.18.x. There are a lot of changes between these branches.
Check CPUs 0-1 and 10-14 to see the mismatch. Same with CPUs 15-19.
From my assistant, Ms. Gemini:
While these messages look like errors, they are often treated as warnings. Here is what might happen:
- Suboptimal Scheduling: The Energy Aware Scheduler (EAS) might be disabled for these clusters. This means the OS won't be as efficient at picking the "cheapest" core for a task, which can lead to higher power consumption or slightly lower battery life.
- Thermal Management: The kernel might struggle to accurately predict how much heat a task will generate on those specific cores.
Hopefully the Green Team will take notice and address the issue in a future firmware or kernel release.