I followed instruction from:
I paste hereby my current hardware situation:
Docker version 28.2.2
ii nvidia-container-toolkit 1.17.8-1 amd64 NVIDIA Container toolkit
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 12.8 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 NVL On | 00000000:C9:00.0 Off | 0 |
| N/A 38C P0 61W / 400W | 0MiB / 95830MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA H100 NVL On | 00000000:E1:00.0 Off | 0 |
| N/A 38C P0 62W / 400W | 0MiB / 95830MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
Everything respect the requirements in the guide. However, the model raise OOM issue even with small sequence (150k).
It’s visible the model is distributed over both GPUs, and at inference time, it use both GPUs.
I’m wondering if you are aware of this, if there is any solution, or something I’m missing.
Any response would be really appreciated.
Cause the model works with smaller sequences.
For instance, after they solve the issue with enviroment from github repo, I’m having same issue with small differences:
from docker: → model is 40GB for each GPU (as expected from the guide)
from github repo: → model is 45 GB for GPU (guess some of ur enviroment optimizazion)
HOWEVER, the model from github repo, can process longer sequences (~150k, not more). Is there any explanation of this? Mostly considering the model from github is using even more space before inference?