Unfortunately, using the slower ‘safetensors’ loader doesn’t make a difference, but I did see something weird. The 2nd node (GX10) spiked to 121 GB of memory usage after only 5% of the model had loaded. Of course, things halted at this point since the 2nd node became unresponsive, but eventually things kept going until the model loaded to 100%, and the same 121GB memory issue happened, locking up both nodes. You can see here the strange delay around 5% loading:
Loading safetensors checkpoint shards: 0% Completed | 0/40 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 2% Completed | 1/40 [00:11<07:10, 11.05s/it]
Loading safetensors checkpoint shards: 5% Completed | 2/40 [00:22<07:14, 11.44s/it]
Loading safetensors checkpoint shards: 8% Completed | 3/40 [07:22<2:01:56, 197.75s/it]
Loading safetensors checkpoint shards: 10% Completed | 4/40 [07:34<1:14:45, 124.59s/it]
Loading safetensors checkpoint shards: 12% Completed | 5/40 [07:45<48:42, 83.50s/it]
Loading safetensors checkpoint shards: 15% Completed | 6/40 [07:55<33:10, 58.54s/it]
Loading safetensors checkpoint shards: 18% Completed | 7/40 [08:05<23:26, 42.62s/it]
Loading safetensors checkpoint shards: 20% Completed | 8/40 [08:18<17:43, 33.23s/it]
Loading safetensors checkpoint shards: 22% Completed | 9/40 [08:29<13:33, 26.25s/it]
Loading safetensors checkpoint shards: 25% Completed | 10/40 [08:40<10:45, 21.52s/it]
Loading safetensors checkpoint shards: 28% Completed | 11/40 [08:51<08:51, 18.33s/it]
Loading safetensors checkpoint shards: 30% Completed | 12/40 [09:03<07:40, 16.43s/it]
Loading safetensors checkpoint shards: 32% Completed | 13/40 [09:13<06:28, 14.37s/it]
Loading safetensors checkpoint shards: 35% Completed | 14/40 [09:22<05:36, 12.96s/it]
Loading safetensors checkpoint shards: 38% Completed | 15/40 [09:32<04:58, 11.94s/it]
Loading safetensors checkpoint shards: 40% Completed | 16/40 [09:44<04:48, 12.00s/it]
Loading safetensors checkpoint shards: 42% Completed | 17/40 [09:56<04:35, 11.97s/it]
Loading safetensors checkpoint shards: 45% Completed | 18/40 [10:08<04:23, 11.97s/it]
Loading safetensors checkpoint shards: 48% Completed | 19/40 [10:19<04:04, 11.65s/it]
Loading safetensors checkpoint shards: 50% Completed | 20/40 [10:30<03:50, 11.51s/it]
Loading safetensors checkpoint shards: 52% Completed | 21/40 [10:41<03:36, 11.40s/it]
Loading safetensors checkpoint shards: 55% Completed | 22/40 [10:53<03:27, 11.52s/it]
Loading safetensors checkpoint shards: 57% Completed | 23/40 [11:05<03:17, 11.62s/it]
Loading safetensors checkpoint shards: 60% Completed | 24/40 [11:16<03:03, 11.47s/it]
Loading safetensors checkpoint shards: 62% Completed | 25/40 [11:28<02:54, 11.64s/it]
Loading safetensors checkpoint shards: 65% Completed | 26/40 [11:40<02:44, 11.76s/it]
Loading safetensors checkpoint shards: 68% Completed | 27/40 [11:52<02:33, 11.77s/it]
Loading safetensors checkpoint shards: 70% Completed | 28/40 [12:03<02:18, 11.54s/it]
Loading safetensors checkpoint shards: 72% Completed | 29/40 [12:14<02:05, 11.40s/it]
Loading safetensors checkpoint shards: 75% Completed | 30/40 [12:25<01:55, 11.51s/it]
Loading safetensors checkpoint shards: 78% Completed | 31/40 [12:37<01:42, 11.43s/it]
Loading safetensors checkpoint shards: 80% Completed | 32/40 [12:49<01:32, 11.56s/it]
Loading safetensors checkpoint shards: 82% Completed | 33/40 [13:00<01:20, 11.53s/it]
Loading safetensors checkpoint shards: 85% Completed | 34/40 [13:11<01:07, 11.29s/it]
Loading safetensors checkpoint shards: 88% Completed | 35/40 [13:22<00:55, 11.17s/it]
Loading safetensors checkpoint shards: 90% Completed | 36/40 [13:33<00:45, 11.27s/it]
Loading safetensors checkpoint shards: 92% Completed | 37/40 [13:45<00:34, 11.43s/it]
Loading safetensors checkpoint shards: 95% Completed | 38/40 [13:56<00:22, 11.38s/it]
Loading safetensors checkpoint shards: 98% Completed | 39/40 [14:04<00:10, 10.22s/it]
Loading safetensors checkpoint shards: 100% Completed | 40/40 [14:16<00:00, 10.84s/it]
Loading safetensors checkpoint shards: 100% Completed | 40/40 [14:16<00:00, 21.41s/it]
And yes, I drop the cache before each run. Here’s the journalctl output from the last boot, right before the hard reset:
May 05 06:45:04 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:04 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:04 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
Something definitely seems off here.