Training stops with "Stalled ranks" message

Hi, I’m using “train_multigpu_finetune.sh” to fine-tune the ngc covid19 model to my data (binary classification). Training goes well until at a certain point where it keeps printing the following “Stalled ranks:”:


GPUs are down and training is stalled even after waiting for about an hour - an idea what might be the issue? This is my train_multigpu_finetune.sh script:

thanks!

Hi
This seems like a data problem. Could you share your data list dataset_vsall.json file. Also what model are you training? problem type 2d/3d classification/segmentation?
Thanks

Hi, I’m working with “clara_train_covid19_3d_ct_classification”, dataset.json is attached - please change extension back from .log to .json (can’t upload .json on forum). dataset_0vsall.log (123.6 KB)

If this is about the dataset wouldn’t we expect for Clara train to not even go into training the first epoch? In my first screen, you can see the error comes from epoch 62.

Hi
Well this actually confirms my guess that it is a data problem. Clara train doesn’t run/check on all the data files before starting. If Clara train starts for 60 epochs then errors out then there is something wrong with the data it sees in this epoch.
Could you shrink the number of files in your data set to 10 or less and see it you can train ?

Another reason (not in your case but in similar situations) would be if you have something different in the validation step and you always get an error on the validation step

Hope that helps

Hi! So it seems like it’s not data-related after performing some experiments. I’d run the same train_finetune.sh script on the same dataset.json and it works. Unfortunately the opposite is also true, different dataset.json files that worked before could suddenly have a continuous “stalled ranks” warning, stalling the training procedure. I’ll try to isolate the problem and share all the configs and warnings.