During model training, I performed high-frequency DVFS adjustments using pynvml.nvmlDeviceSetGpuLockedClocks
for frequency tuning. Based on my experimental results, high-frequency calls to nvmlDeviceSetGpuLockedClocks
(approximately every 50ms) caused significant delays in certain NCCL communication operators. This delay does not seem to be related to the frequency value itself; if I fix a very low frequency and avoid frequent adjustments, the delay does not occur. Does anyone know why this happens?
Additionally, does anyone know whether, when eight processes on a server simultaneously set frequencies for eight GPUs, the NVML operations are executed in parallel or sequentially?