What’s the time resolution of your scope measurements?
The nominal power rating of a CPU or GPU is based on the average power consumption across tens of seconds and it mostly important for managing thermal issues, i.e. sizing of cooling solutions, which is why it is sometimes called TDP (thermal design power). The same is true for the nominal power rating of PCIe auxilliary power cables which nominally supply up to 150W per 8-pin connector and up to 75W per 6-pin connector.
Rapid changes in workload intensity in conjunction with dynamic clocking employed by modern high-performance processors (CPUs as well as GPUs) can lead to significant power spikes on the order of microseconds to tens of microseconds. These power spikes can be more pronounced in compute apps than in graphics apps, as these different application classes exercise the functional units of the GPU differently, and are commonly observed with machine learning apps.
So if your oscilloscope can measure with, say, millisecond resolution then it would be normal to observe such power spikes, though I am a bit surprised that they would reach as high as 600W. While CPU and GPU power spikes are usually not in sync, quasi-simultaneous power spikes can occur and can contribute to a power supply being overwhelmed. When this happens, it most frequently manifests as random re-boots a few minutes into running a machine-learning application. This is caused by the power spike leading to a voltage drop (“brown-out”). In more severe cases, the power supply itself may shut down.
A properly sized PSU (power supply unit) is therefore important for HPC systems including those running AI tasks. My standing recommendation for rock-solid operation across a projected system life span of five years is to size the PSU such that the sum of the nominal power consumption of all system components does not significantly exceed 60% of the nominal power rating of the PSU. Assume 0.4W per GB of DDR4 system memory when summing the nominal power consumption of the system components.
In addition I recommend paying attention to the 80PLUS rating of the PSU, and use an 80PLUS Gold compliant PSU as the minimum for a high-performance workstation, with 80PLUS Platinum preferred. For a high-performance server, 80PLUS Platinum as the minimum, with 80PLUS Titanium preferred. PSUs with high 80PLUS ratings are more efficient, tend to run cooler (which helps extend the lifetime of electronic components), usually are designed with higher engineering margins and with better quality components, and often come with longer vendor warranties. The recommendation for a higher 80PLUS level for servers is based on difference in duty cycle compared to a workstation.
Per this Reddit thread, the 600W spikes you observed with the RTX 3090 are roughly in line with what others have observed:
I had a chat to Seasonic, they let me know that in their labs they have seen RTX 3090 transient loads spike to north of 550W before the power limits kick in and pull them back down.
There are some comments in that thread that NVIDIA and CPU manufacturers “need to get power spikes under control”. CPUs and GPUs already have active power management, but any such mechanism has finite response time. Best I know, current systems respond within 100 milliseconds, possibly less than that. While hardware vendors may be able to reduce the response time further (I do not have the expertise to guesstimate what a reasonable lower bound could be), it will never be zero and therefore power spikes will continue to exist.
[Even later:] This review of the RTX 3090 FE includes oscilloscope pictures in which some narrow 1-millisecond power spikes of up to 570W are visible. They conclude:
For this card, I would therefore calculate at least 460 to 500 watts as a proportion of the total secondary power consumption of the system
Their graph showing the highest power draw observed over various durations seems to suggest that the RTX 3090 power management is capable of reducing power draw to the 350W nominal limit within about 25 milliseconds.