Training Classification model from scratch


I have ~200K images for training, ~10 classes, classification model.

I need to work in a different (larger) resolution than the 224x224 provided with the pre-computed weights model. Not sure resizing the imgs will work well.

Was wondering - how much effort it will take to train up to the level where the transfer knowledge (the weights similar to the one d/n from the TLT repository) will be reasonable ? Say I have a single 2080 core - is it in terms of days, weeks, months or years ?

Thanks for the help !

Hi dannykario,
For Classification, according to tlt doc,
Input size: 3 * H * W (W, H >= 16)
Input format: JPG, JPEG, PNG
Note: Classification input images do not need to be manually resized. The input dataloader resizes images as needed.

Not sure the exact effort for you to train but I think it is in terms of days.

Hi Morganh,
I wanna train the classification model (ResNet18) with 2 classes.
Currently, as you know ResNet18 has the output of classification for 20 classes.
How can I custom it, I just want to change the last layers of ResNet18 from 20 classes to 2 classes?

I can change the last layer of ResNet18 by following code:
[i][i]import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Model
model = load_model("/workspace/tlt-experiments/pretrained_resnet18/tlt_resnet18_classification_v1/resnet18.hdf5")
for i,layer in enumerate(model.layers):

x = model.layers[77].output
x = Dense(2, activation=‘softmax’)(x)
model = Model(inputs= model.input, outputs=x)

I’ve changed classification_spec.cfg to the custom model:
[i]model_config {
arch: “resnet”,
n_layers: 18

Setting these parameters to true to match the template downloaded from NGC.

use_bias: true
use_batch_norm: true
all_projections: true
freeze_blocks: 0
freeze_blocks: 1
input_image_size: “3,224,224”
train_config {
train_dataset_path: “/workspace/tlt-experiments/custom_classification/custom_data/train”
val_dataset_path: “/workspace/tlt-experiments/custom_classification/custom_data/val”
pretrained_model_path: “/workspace/tlt-experiments/custom_classification/tlt_resnet18_classification_v1/custom_resnet18.hdf5”
optimizer: “sgd”
batch_size_per_gpu: 64
n_epochs: 80
n_workers: 16


reg_config {
type: “L2”
scope: “Conv2D,Dense”
weight_decay: 0.00005


lr_config {
scheduler: “step”
learning_rate: 0.006
#soft_start: 0.056
#annealing_points: “0.3, 0.6, 0.8”
#annealing_divider: 10
step_size: 10
gamma: 0.1
eval_config {
eval_dataset_path: “/workspace/tlt-experiments/custom_classification/custom_data/test”
model_path: “/workspace/tlt-experiments/custom_classification/output/weights/resnet_080.tlt”
top_k: 3
batch_size: 256
n_workers: 8

When I run tlt-train with that model by command:
!tlt-train classification -e $SPECS_DIR/classification_spec.cfg -r $USER_EXPERIMENT_DIR/output -k $KEY

I got an error:

Using TensorFlow backend.
2019-11-07 07:26:58.310036: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-07 07:26:58.418619: I tensorflow/stream_executor/cuda/] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-07 07:26:58.419153: I tensorflow/compiler/xla/service/] XLA service 0x546f370 executing computations on platform CUDA. Devices:
2019-11-07 07:26:58.419190: I tensorflow/compiler/xla/service/] StreamExecutor device (0): GeForce GTX 1070 Ti, Compute Capability 6.1
2019-11-07 07:26:58.421264: I tensorflow/core/platform/profile_utils/] CPU Frequency: 3199975000 Hz
2019-11-07 07:26:58.421487: I tensorflow/compiler/xla/service/] XLA service 0x54d7200 executing computations on platform Host. Devices:
2019-11-07 07:26:58.421519: I tensorflow/compiler/xla/service/] StreamExecutor device (0): ,
2019-11-07 07:26:58.421657: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties:
name: GeForce GTX 1070 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 6.95GiB
2019-11-07 07:26:58.421687: I tensorflow/core/common_runtime/gpu/] Adding visible gpu devices: 0
2019-11-07 07:26:58.422319: I tensorflow/core/common_runtime/gpu/] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-07 07:26:58.422335: I tensorflow/core/common_runtime/gpu/] 0
2019-11-07 07:26:58.422343: I tensorflow/core/common_runtime/gpu/] 0: N
2019-11-07 07:26:58.422421: I tensorflow/core/common_runtime/gpu/] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6764 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
Found 1474 images belonging to 3 classes.
2019-11-07 07:26:58,531 [INFO] iva.makenet.scripts.train: Processing dataset (train): /workspace/tlt-experiments/custom_classification/custom_data/train
Found 210 images belonging to 3 classes.
2019-11-07 07:26:58,635 [INFO] iva.makenet.scripts.train: Processing dataset (validation): /workspace/tlt-experiments/custom_classification/custom_data/val
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-11-07 07:26:58,648 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 10, in
File “./common/”, line 27, in main
File “./makenet/scripts/”, line 403, in main
File “./makenet/scripts/”, line 314, in run_experiment
File “./makenet/utils/”, line 48, in model_io
File “/usr/local/lib/python2.7/dist-packages/keras/engine/”, line 419, in load_model
model = _deserialize_model(f, custom_objects, compile)
File “/usr/local/lib/python2.7/dist-packages/keras/engine/”, line 225, in _deserialize_model
model = model_from_config(model_config, custom_objects=custom_objects)
File “/usr/local/lib/python2.7/dist-packages/keras/engine/”, line 458, in model_from_config
return deserialize(config, custom_objects=custom_objects)
File “/usr/local/lib/python2.7/dist-packages/keras/layers/”, line 55, in deserialize
File “/usr/local/lib/python2.7/dist-packages/keras/utils/”, line 145, in deserialize_keras_object
File “/usr/local/lib/python2.7/dist-packages/keras/engine/”, line 1022, in from_config
File “/usr/local/lib/python2.7/dist-packages/keras/engine/”, line 1008, in process_layer
File “/usr/local/lib/python2.7/dist-packages/keras/layers/”, line 55, in deserialize
File “/usr/local/lib/python2.7/dist-packages/keras/utils/”, line 138, in deserialize_keras_object
': ’ + class_name)
ValueError: Unknown layer: BatchNormalizationV1

How can I fix it?

Hi chuongvodoi95,
The quantity of classes depends on your dataset. It has nothing to do with resnet18.
If you want to train 2 classes, you need to split your dataset into 2 classes and make sure each train/val/test folder have two classes.
For example,
$ ls train
cat dog
$ ls val
cat dog
$ ls test
cat dog

Folder cat have lots of cat images.
Folder dog have lots of dog images.

Thank you so much Morganh