Clock and Power Inconsistency in COMPUTE Workload

When running a compute workload there are multiple issues.

  • GPU won't go to max graphics clock despite (a) ample thermal (<30C) and power headroom (verified via nvidia-smi -q) (b) clock is 300-600Mhz below Max clock and core utilization is high 50-99%
  • I can *incrementally* trigger max clock if I enter a core clock offset in nvidia-settings GUI and *repeatedly* press enter while workload is running, otherwise it will stay 600Mhz below max clock. So clearly it can handle max clock, it just won't naturally get there on its own.
  • Performance level is `3` in nvidia-settings and P0 in nvidia-smi ... so that isn't the problem
  • When the workload finishes, both core and memory clocks won't go back down to idle state unless I enter (any) value into nvidia-settings clock/memory offsets fields.
  • I have verified the clock and power values via nvidia-settings, nvidia-smi and nvidia-smi stats and they agree.
  • The fact that I can increment the core clock with every "ENTER" into offset field while the job is running seems telling, as is the inability to downclock and lower performance level unless I again enter some value into the offset triggers it to update in both cases.

NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0

Please put load on the gpu and run as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.