That’s actually a really good idea - Treating the same node as two nodes through a distributed job is a fine approach. I just tried it but it fails with the same error, but gave another data point:
- Once the first process is using GPUs 0,1,2,3, CUDA errors out when trying to initialize the remaining 4,5,6,7. So something that is cross-process is blocking initialization of more than 4 at a time
example: run 2 different processes:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -c ‘import torch; import time; torch.cuda.is_available(); time.sleep(100)’
CUDA_VISIBLE_DEVICES=4,5,6,7 python -c ‘import torch; import time; torch.cuda.is_available(); time.sleep(100)’
If you specify the same GPUs in both processes, it will succeed. However if you specify different GPUs (>4) across both processes, whatever is launched second will fail