GB10 is power limited after crash

The Asus Ascent is power limited after an OOM crash or other crashes which involve terminating a cuda workload. nvidia-smi reports 5-7-9 W usage after this happens, even after reboot. The performance of the GB10 is restored only after the device is powered off and its power brick is unplugged for a while. The temperature reported by nvidia-smi is 38 C. nvidia-smi reports power usage up to ~80W and temperature up to 55 C at 96% GB10 load. This is obviously something which can be detected.

The software is fully updated. The firmware is up to date. The machine’s power brick has been unplugged several times after the firmware updates.

This hardware promised a lot more. It delivers too little right now. It’s slow and buggy. It’s really not worth the money.

Please execute NVIDIA DGX Spark Field Diagnostics | NVIDIA then DM me the log bundle. You may need to RMA the unit with ASUS pending diag results.

I’ve provided the diagnostic log bundles. One bundle is the result of a diagnostic run after a crash. The other one is the result of a diagnostic run after unplugging the power brick and powering the system again.

I haven’t heard anything from anyone at Nvidia since providing the diagnostics files as instructed. It took about 1h40 to obtain the diagnostics both in a bad state with the GPU in a power limited state and in a good state after a full power cycle.

I’ve used a script posted in another thread.

GitHub - hoesing/spark-gpu-throttle-check: Test to see if a DGX Spark (or similar GB10 device) is throttling due to possible Power Delivery issues Β· GitHub at commit 6a27755b37c753ed09a99b29d4732379a4ba4a14

Test in a good state:

============================================================

Spark GPU Throttle Check

GPU state at idle:
Clock: 208 / 3003 MHz
P-state: P8
Power: 4.4 W

Warming up GPU (2.0s)…

Collecting 20 samples under load (0.5s interval)…
Threshold: 1400 MHz

  #  Clock (MHz)  Max (MHz)  PState  Power (W)

───── ─────────── ───────── ────── ─────────
1 2392 3003 P0 69.5
2 2392 3003 P0 69.1
3 2392 3003 P0 69.2
4 2392 3003 P0 69.3
5 2392 3003 P0 69.4
6 2392 3003 P0 69.7
7 2392 3003 P0 70.3
8 2392 3003 P0 71.0
9 2392 3003 P0 70.9
10 2405 3003 P0 70.9
11 2405 3003 P0 71.2
12 2405 3003 P0 71.1
13 2405 3003 P0 71.0
14 2405 3003 P0 71.2
15 2405 3003 P0 71.2
16 2405 3003 P0 71.2
17 2405 3003 P0 71.6
18 2405 3003 P0 71.4
19 2405 3003 P0 71.6
20 2405 3003 P0 71.5

────────────────────────────────────────────────────────────
RESULTS
────────────────────────────────────────────────────────────
Samples: 20
Peak clock: 2405 MHz
Average clock: 2399 MHz
Avg power draw: 70.6 W
Below threshold: 0% of samples < 1400 MHz

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PASS β€” GPU clocks look healthy under load. β”‚
β”‚ Peak: 2405 MHz, Avg: 2399 MHz β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This gb10 doesn’t end up in a power limited state as long as it doesn’t crash due to an OOM condition. It doesn’t crash either during regular use (no OOM). I’ll keep testing and RMA if nobody at Nvidia follows up. I can’t believe how bad this hardware is. The GPU drivers are old. The hardware and the firmware appear to be buggy.

@NVES: Can you confirm that you’ve received the files?

We have received the logs and confirmed they do not show an issue with the unit. However, I have reached out to engineering and will report back when I have more information.

Thank you. I had no idea what’s going on after sending the requested logs. It’s good to know that such an expensive device isn’t about to become a paperweight due to some faulty components. I’ll wait for a proper solution (firmware update, recommendation to RMA, etc). It’s unacceptable for such expensive hardware to be this bad. It’s not a cheap $ 10 device bought off some random website from Asia. Nvidia doesn’t look too good to me. The hardware performs poorly as it is when it’s not throttled and doesn’t run into bugs.

nvidia-smi currently reports the power at 4-14 W. There was no crash this time. Performance dropped for inference. The GB10 is completely useless in this state.

Unplugging the power brick after shutdown for 2 minutes worked around the problem. This GB10 was turned off for a while. It appears to have booted in this broken state.

This is very difficult to work with.