Training problems: mAP, precision, recall all zero

Hi all,

We recently started exploring DIGITS as a training and deployment tool and have run into issues with training pretrained models on our dataset to detect peoples’ heads. The major issue we’re seeing is that the accuracy metrics (mAP, precision, and recall) stay zero throughout all 100 epochs of training. Also, the losses don’t appear to be converging during training. We were using a very small dataset (433 training images, 27 validation images) for this and plan to retry with a larger dataset. The images in our dataset also consist of .jpg’s with a resolution of 704x480, and we resized them with the DIGITS dataset creation tool to 1248x384 to match the input sizes to the BVLC_GoogLeNet pretrained model used in the object-detection example here https://github.com/NVIDIA/DIGITS/blob/master/examples/object-detection/README.md. We used the following parameters in the object-detection model creation page:

Epochs: 100
Snapshot interval: 1.0
Validation interval: 1.0
Random seed: None
Batch size: 6
Batch accumulation: 1
Blob format: NVCaffe
Solver: Adam
Base LR: 0.0001

Subtract Mean: None
Crop size: None

Custom network with the network definition from the object-detection example linked to above (modified to point to our dataset’s database), and the pretrained model bvlc_googlenet.caffemodel linked to from the same example instructions.

Our generated network definition files and log files from our model and dataset DB are accessible here https://www.dropbox.com/sh/nmatjm0h3952mv2/AAB2Ow8x0NYk0FORuOcaivZia?dl=0.

Interestingly, we also had issues training the BVLC_GoogLeNet pretrained model on the KITTI dataset as per the instructions here https://github.com/NVIDIA/DIGITS/blob/master/examples/object-detection/README.md. We kept seeing the model collapse at epoch 8 through several iterations of training the model. Specifically, at epoch 8, the loss would skyrocket and the accuracy metrics would fall to zero before recovering but staying below their previous maxima This issue apparently was resolved by adding the flags for interactive mode ‘-it’ to the nvidia-docker run command. We’re not sure if this was what actually fixed this problem, but maybe there are some subtleties with interactive mode that DIGITS requires.

Environment info:
Running NVIDIA DIGITS Docker container 18.05 with command line:

nvidia-docker run -it --rm --name digits -d -p 8888:5000 -v /home/sean/digits-workspace/data:/data -v /mnt/data/passenger_capacity_data:/passenger_capacity_data -v /home/sean/digits-workspace/jobs:/workspace/jobs nvcr.io/nvidia/digits:18.05

DIGITS version: 6.1.1
Caffe version: 0.17.0

Thanks,

Sean
faceblur_sandiego_detectnet1248x384_create_val_db_db.log (786 Bytes)
faceblur_sandiego_detectnet1248x384_create_train_db_db.log (1.34 KB)

faceblur_sandiego_detectnet1248x384_caffe_output.log (890 KB)

Do have any updates regarding this issue? My training loss also don’t seem to converging. I am using out of the box AlexNet architecture in DIGITS 20.02 . Thanks!