The suspected PD throttling issue has been consistent enough that I wanted to create a quick tool to check for the failure state. Some noodling around with Claude resulted in:
It checks to see if a GPU load runs over or under a default clock speed threshold (defaults to 1400MHz) over a number of samples (defaults to 20 samples spaced at about 0.5s). A GPU in a good state should hit ~2400MHz. Mine in a bad state are showing 500-850MHz, even though the power state is still showing P0.
Happy to accept suggestions for improvement, here or in the issues on the GitHub repo.
Hopefully NVIDIA figure out the root cause of this issue, because it’s pretty bad, requiring physical access to the device to resolve, and no real indicators of failure state other than poor performance.
No, this is inconsistent behavior that resolves itself when you disconnect the power brick from the wall. There are multiple threads on this.
Basically, the GPU seems to enter a reduced clock state that does not resolve itself by rebooting, shutting down, or even physically removing the USB-C power cable. The power needs to be removed from the power brick itself in order to return to full GPU clock. I’m still trying to figure out a repro case to confirm this, but there are multiple reports of this in the forums that all resolved by removing power from the brick.
Assuming that’s true, there seems to be some issue at the brick level, because if the USB-C cable is removed from the Spark, and plugging it back in doesn’t resolve it, that seems to implicate the brick and PD negotiation.
I’ve seen this behavior on both of my systems, a FE Spark, and the MSI. I created this diagnostic to both know when a Spark has entered this state, and also to try to narrow down what causes it to get there.
I wonder if the brick itself has any built in safeguards in case of overheating, that reset after its unplugged for a while?
If someone experiencing this issue has 2 sparks / bricks, maybe they could test by switching power adaptors between 2 sparks after coming across the problem?
im having the same issue after the update on 2 sparks, switching power adaptors did not resolve the issue, this is crazy on a 4000$ device (heck 8000$ in this case)
I have had this issue on both of mine, also a FE Spark and an MSI unit. It becomes obvious when the PP takes a dump and running this script will show the failure. Only unplugging the brick fixes it. One thing that did trigger it for me turned out to be a bad CX7 cable from FS. That cable caused a host of network errors and from time to time the brick would enter this state on both machines randomly. I corrected the cable with a new one and now the problem happens much less frequently.