(EngineCore_0 pid=194) ERROR 10-21 01:05:31 [core.py:700] raise ValueError(
(EngineCore_0 pid=194) ERROR 10-21 01:05:31 [core.py:700] ValueError: Free memory on device (49.37/119.7 GiB) on startup is less than desired GPU memory utilization (0.5, 59.85 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
On suggestion from Codex I tried -e GPU_MEMORY_UTILIZATION=0.3 but it didn’t help. DGX dashboard (shows 100GB free) and nvidia-smi (shows 59GB used) show different memory usage numbers. See screenshot Why is that?
Appreciate any other ideas on how to get this working. Thanks!
The number on the DGX Dashboard shows the amount of memory being used, not how much is free. The nvidia-smi command is not always entirely accurate due to the unified memory architecture. Please reference our FAQ for more information.
This is a known issue that will be fixed in the next update. For now you can also look at /proc/meminfo at the MemAvailable section for the most accurate number.