Acpi and EM errors from kernel log during boot (capacity mismatch)

Please can anyone help with these boot errors? The system is working however it looks like certain kernel subsystems i.e. Energy Aware Scheduling are not fully compatible yet with the GB10’s heterogeneous CPU topology. Output from journalctl -b -p err

kernel: platform NVDA8900:00: failed to claim resource 0: [mem 0xc8000000-0xd7ffffff]
kernel: acpi NVDA8900:00: platform device creation failed: -16
kernel: pci 000f:01:00.0: DOE: [2c8] ABORT timed out
kernel: pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5
kernel: pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5
kernel: xhci_endpoint_init rsv 0x801
kernel: xhci_endpoint_init rsv 0x801
kernel: xhci_endpoint_init rsv 0x801
kernel: xhci_endpoint_init rsv 0x801
kernel: xhci_endpoint_init rsv 0x801
kernel: xhci_endpoint_init rsv 0x801
kernel: xhci_endpoint_init rsv 0x801
kernel: processor cpu0: EM: CPUs of 0-4,10-14 must have the same capacity
kernel: processor cpu1: EM: CPUs of 0-4,10-14 must have the same capacity
kernel: processor cpu2: EM: CPUs of 0-4,10-14 must have the same capacity
kernel: processor cpu3: EM: CPUs of 0-4,10-14 must have the same capacity
kernel: processor cpu4: EM: CPUs of 0-4,10-14 must have the same capacity
kernel: processor cpu10: EM: CPUs of 0-4,10-14 must have the same capacity
kernel: processor cpu11: EM: CPUs of 0-4,10-14 must have the same capacity
kernel: processor cpu12: EM: CPUs of 0-4,10-14 must have the same capacity
kernel: processor cpu13: EM: CPUs of 0-4,10-14 must have the same capacity
kernel: processor cpu14: EM: CPUs of 0-4,10-14 must have the same capacity
kernel: processor cpu15: EM: CPUs of 15-19 must have the same capacity
kernel: processor cpu16: EM: CPUs of 15-19 must have the same capacity
kernel: processor cpu17: EM: CPUs of 15-19 must have the same capacity
kernel: processor cpu18: EM: CPUs of 15-19 must have the same capacity
kernel: processor cpu19: EM: CPUs of 15-19 must have the same capacity

nvidia-bug-report.log.gz (547.7 KB)

1 Like
$ for c in /sys/devices/system/cpu/cpu*/cpu_capacity; do   cpu=$(basename "$(dirname "$c")"); printf "%s: %s\n" "$cpu" "$(cat "$c")"; done | sort -n -k 1.4
cpu0: 718
cpu1: 718
cpu2: 718
cpu3: 718
cpu4: 718
cpu5: 997
cpu6: 997
cpu7: 997
cpu8: 997
cpu9: 997
cpu10: 731
cpu11: 731
cpu12: 731
cpu13: 731
cpu14: 731
cpu15: 1017
cpu16: 1017
cpu17: 1017
cpu18: 1017
cpu19: 1024
1 Like

Hi, is this resolved? I am habing similar issues

Many of these messages will be addressed in the next update. If you see any performance issues please reach back out

1 Like

@charith987 I haven’t found a resolution for these boot errors. I haven’t noticed specific performance issues, but I imagine CPU core selection may be suboptimal if EAS is disabled.

The reply from @aniculescu is positive so hopefully it’s fixed in the next update. Otherwise I was planning to email the mailing list.

Got the same errors during boot. Is there any information when next update will be?

I am still seeing these errors on Dell GB10. Curiously when the system was first configured these errors were not present. I suspect it is a bug in the grub default command line.

If you study this diagram you can see there is a lot of CPU heterogeneity going on:

Namely, the two CPU clusters:

  • have a mixture of A725 and X925 cores;
  • have different L3 cache sizes; and
  • have different clock speeds for the X925

and as also suggested here:

my theory until we hear more from @aniculescu or NVIDIA is that this is making things tricky for the OS scheduler.

I just updated my Spark to the latest software and journalctl -p err still reports:

Jan 11 15:59:45 kernel: processor cpu0: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu1: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu2: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu3: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu4: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu10: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu11: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu12: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu13: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu14: EM: CPUs of 0-4,10-14 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu15: EM: CPUs of 15-19 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu16: EM: CPUs of 15-19 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu17: EM: CPUs of 15-19 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu18: EM: CPUs of 15-19 must have the same capacity
Jan 11 15:59:45 kernel: processor cpu19: EM: CPUs of 15-19 must have the same capacity

and the cpu capacity report is the same as I reported earlier in #2.

@j0n those are kernel warning and not errors! And while they’re annoying there’s nothing to worry about them. It looks like a firmware bug and some ACPI tables need updating. The Spark is using also an older kernel, the 6.14.x branch, while the latest Linux stable kernel is 6.18.x. There are a lot of changes between these branches.

Output of processor topology from lscpu -e:

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ    MINMHZ       MHZ
  0    0      0    0 0:0:0:0          yes 2808.0000  338.0000 1924.0000
  1    0      0    1 1:1:1:0          yes 2808.0000  338.0000 1976.0000
  2    0      0    2 2:2:2:0          yes 2808.0000  338.0000 2028.0000
  3    0      0    3 3:3:3:0          yes 2808.0000  338.0000 1976.0000
  4    0      0    4 4:4:4:0          yes 2808.0000  338.0000 2002.0000
  5    0      0    0 5:5:5:0          yes 3900.0000 1378.0000 4030.0000
  6    0      0    1 6:6:6:0          yes 3900.0000 1378.0000 2964.0000
  7    0      0    2 7:7:7:0          yes 3900.0000 1378.0000 2912.0000
  8    0      0    3 8:8:8:0          yes 3900.0000 1378.0000 3328.0000
  9    0      0    4 9:9:9:0          yes 3900.0000 1378.0000 7254.0000
 10    0      0    5 10:10:10:1       yes 2860.0000  338.0000 2314.0000
 11    0      0    6 11:11:11:1       yes 2860.0000  338.0000 2158.0000
 12    0      0    7 12:12:12:1       yes 2860.0000  338.0000 2080.0000
 13    0      0    8 13:13:13:1       yes 2860.0000  338.0000 2236.0000
 14    0      0    9 14:14:14:1       yes 2860.0000  338.0000 2210.0000
 15    0      0    5 15:15:15:1       yes 3978.0000 1378.0000 3068.0000
 16    0      0    6 16:16:16:1       yes 3978.0000 1378.0000 3094.0000
 17    0      0    7 17:17:17:1       yes 3978.0000 1378.0000 3120.0000
 18    0      0    8 18:18:18:1       yes 3978.0000 1378.0000 3588.0000
 19    0      0    9 19:19:19:1       yes 4004.0000 1378.0000 3614.0000

Check CPUs 0-1 and 10-14 to see the mismatch. Same with CPUs 15-19.

From my assistant, Ms. Gemini:

While these messages look like errors, they are often treated as warnings. Here is what might happen:

- Suboptimal Scheduling: The Energy Aware Scheduler (EAS) might be disabled for these clusters. This means the OS won't be as efficient at picking the "cheapest" core for a task, which can lead to higher power consumption or slightly lower battery life.

- Thermal Management: The kernel might struggle to accurately predict how much heat a task will generate on those specific cores.

Hopefully the Green Team will take notice and address the issue in a future firmware or kernel release.

1 Like