I am experiencing weird results when going from 1 or 2 gpus to 8 gpus when training a FasterRCNN model in TLT 3.0.
When I train the model with just 1 or 2 gpus, I get expected validation scores (e.g. mAP). However, when I train a model on the same dataset & same specs - but with 8 gpus - I get 0.0s in all the validation metrics.
My immediate thought is that something is going wrong during the training process, however, the loss scores provided during each epoch seem to be fine. It’s just the validation scores that get messed up.
Additionally, the 8gpu-trained model fails to make any inferences when running tlt faster_rcnn inference on a dataset in which the 1gpu and 2gpu -trained models do fine with. That is, the inference commands runs to completion no problem. It just doesn’t make any predictions.
Could you help resolve why our model fails when trained with 8 gpus?
As one last note, can you clarify whether any of the loss metrics printed during training are for validation loss, instead of training loss? We are having trouble finding any documentation on loss vs rpn_out_class_loss vs rpn_out_regress_loss etc. Are they all with respect to the training data? Can you point me to any documentation on those metrics? And obviously the mAP scores I am referring to above are a form of a validation metrics but the documentation doesn’t explain whether other metrics can be used instead. Like “validation_loss”, for example.