DIGITS Custom Model Retraining

dave_saumil · April 13, 2020, 11:51pm

Hello, I’m trying to modify DIGITS to start from a pretrained Keras model. The following is a full summary of the steps I’ve taken so far in the hopes that someone might be able to help identify possible paths towards a solution:

The idea is to add a “Standard Networks” script along with the default provided AlexNet, LeNet, and GoogLeNet, that instead loads a custom model from a Keras weights file.

Outside of DIGITS, I loaded a pretrained Keras model (Car Classifier) from an hdf5 file, added a Dense layer, and retrained on a Car/Truck classification dataset (resized to 224 by 224 since that’s the network’s input size) with Adam optimization (Base LR=0.0001), and was able to get 98% train and 90% val accuracy.

I then pulled the latest DIGITS image from NGC, ran the container, and created a DIGITS dataset using the same Car/Truck classification images (resizing to 224 by 224 via the provided “squash” method). Then, I created a new classification model, specified the optimizer (Adam), base learning rate (0.0001) and dataset (no data augmentation). Basically I was trying to recreate my retraining experiment that was successful outside of DIGITS, within DIGITS. Using the other “Standard Network” scripts as reference and working out of the “Customize Network” tab, within the “UserModel” class, I used tf.keras to load my model architecture and weights from the hdf5 file, replaced the input with “self.x”, and added a Dense layer with “self.nclasses” units. So the output of “inference()” was the output tensor of the modified model. In “loss()”, I used the classification_loss() and classification_accuracy() provided in ‘/digits/tools/tensorflow/utils.py’.

I was then able to successfully kick off the retraining, but found that the train and val accuracies were fluctuating around 50%, the network wasn’t learning. As an experiment, outside of DIGITS, I tried training the same model on the same dataset, but this time with random initialization rather than starting from our pretrained car classifier weights. This time, I saw similar accuracy fluctuation around 50% as I did in DIGITS. I therefore theorized that the pretrained weights were being overwritten/re-initialized somewhere.

I looked at ‘/digits/tools/tensorflow/main.py’, and saw that ‘tf.global_variables_initializer()’ is being run, and likely overwriting my model weights. I then modified the code to only load the model architecture in my “UserModel”, and then edited ‘/digits/tools/tensorflow/main.py’ to load the weights and assign to the appropriate existing variables (by name) after ‘tf.global_variables_initializer()’ is run. Again, train and val accuracies fluctuated around 50%.

I logged one of the model weights in the first Conv layer at each training step. At the start of retraining, it does match that of the pretrained weights that I’m loading in, and the value does change from training step to training step. So updates are being made, but it’s just not learning.

Finally, rather than starting from the car classifier pretrained weights, I took the weights from the model that I successfully retrained on the car/truck dataset outside of DIGITS, and loaded those in ‘/digits/tools/tensorflow/main.py’. Again, outside of DIGITS, this model achieved 98% train accuracy and 90% val accuracy, so I expected retraining in DIGITS to start from around those metrics. Unfortunately, once again, when I kicked off retraining in DIGITS train and val accuracies started and stayed around 50%.

It seems that I’m still missing something, hoping someone has some ideas on how to diagnose/resolve this issue. Please let me know, thanks!