Training Classification model from scratch

dannykario · October 26, 2019, 7:03am

Hi,

I have ~200K images for training, ~10 classes, classification model.

I need to work in a different (larger) resolution than the 224x224 provided with the pre-computed weights model. Not sure resizing the imgs will work well.

Was wondering - how much effort it will take to train up to the level where the transfer knowledge (the weights similar to the one d/n from the TLT repository) will be reasonable ? Say I have a single 2080 core - is it in terms of days, weeks, months or years ?

Thanks for the help !

Morganh · October 26, 2019, 8:29am

Hi dannykario,
For Classification, according to tlt doc,
Input size: 3 * H * W (W, H >= 16)
Input format: JPG, JPEG, PNG
Note: Classification input images do not need to be manually resized. The input dataloader resizes images as needed.

Not sure the exact effort for you to train but I think it is in terms of days.

chuongvodoi95 · November 7, 2019, 6:17am

Hi Morganh,
I wanna train the classification model (ResNet18) with 2 classes.
Currently, as you know ResNet18 has the output of classification for 20 classes.
How can I custom it, I just want to change the last layers of ResNet18 from 20 classes to 2 classes?

chuongvodoi95 · November 7, 2019, 7:50am

I can change the last layer of ResNet18 by following code:
[i][i]import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Model
model = load_model(“/workspace/tlt-experiments/pretrained_resnet18/tlt_resnet18_classification_v1/resnet18.hdf5”)
for i,layer in enumerate(model.layers):
print(i,layer.name)

x = model.layers[77].output
x = Dense(2, activation=‘softmax’)(x)
model = Model(inputs= model.input, outputs=x)
model.summary()[/i]
model.save(“tlt_resnet18_classification_v1/custom_resnet18.hdf5”)[/i]

I’ve changed classification_spec.cfg to the custom model:
[i]model_config {
arch: “resnet”,
n_layers: 18

Setting these parameters to true to match the template downloaded from NGC.

use_bias: true
use_batch_norm: true
all_projections: true
freeze_blocks: 0
freeze_blocks: 1
input_image_size: “3,224,224”
}
train_config {
train_dataset_path: “/workspace/tlt-experiments/custom_classification/custom_data/train”
val_dataset_path: “/workspace/tlt-experiments/custom_classification/custom_data/val”
pretrained_model_path: “/workspace/tlt-experiments/custom_classification/tlt_resnet18_classification_v1/custom_resnet18.hdf5”
optimizer: “sgd”
batch_size_per_gpu: 64
n_epochs: 80
n_workers: 16

regularizer

reg_config {
type: “L2”
scope: “Conv2D,Dense”
weight_decay: 0.00005
}

learning_rate

lr_config {
scheduler: “step”
learning_rate: 0.006
#soft_start: 0.056
#annealing_points: “0.3, 0.6, 0.8”
#annealing_divider: 10
step_size: 10
gamma: 0.1
}
}
eval_config {
eval_dataset_path: “/workspace/tlt-experiments/custom_classification/custom_data/test”
model_path: “/workspace/tlt-experiments/custom_classification/output/weights/resnet_080.tlt”
top_k: 3
batch_size: 256
n_workers: 8
}[/i]

When I run tlt-train with that model by command:
!tlt-train classification -e $SPECS_DIR/classification_spec.cfg -r $USER_EXPERIMENT_DIR/output -k $KEY

I got an error:

Using TensorFlow backend.
2019-11-07 07:26:58.310036: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-07 07:26:58.418619: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-07 07:26:58.419153: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x546f370 executing computations on platform CUDA. Devices:
2019-11-07 07:26:58.419190: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1070 Ti, Compute Capability 6.1
2019-11-07 07:26:58.421264: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3199975000 Hz
2019-11-07 07:26:58.421487: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x54d7200 executing computations on platform Host. Devices:
2019-11-07 07:26:58.421519: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2019-11-07 07:26:58.421657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1070 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 6.95GiB
2019-11-07 07:26:58.421687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-11-07 07:26:58.422319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-07 07:26:58.422335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-11-07 07:26:58.422343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-11-07 07:26:58.422421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6764 MB memory) → physical GPU (device: 0, name: GeForce GTX 1070 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
Found 1474 images belonging to 3 classes.
2019-11-07 07:26:58,531 [INFO] iva.makenet.scripts.train: Processing dataset (train): /workspace/tlt-experiments/custom_classification/custom_data/train
Found 210 images belonging to 3 classes.
2019-11-07 07:26:58,635 [INFO] iva.makenet.scripts.train: Processing dataset (validation): /workspace/tlt-experiments/custom_classification/custom_data/val
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-11-07 07:26:58,648 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 10, in
sys.exit(main())
File “./common/magnet_train.py”, line 27, in main
File “./makenet/scripts/train.py”, line 403, in main
File “./makenet/scripts/train.py”, line 314, in run_experiment
File “./makenet/utils/helper.py”, line 48, in model_io
File “/usr/local/lib/python2.7/dist-packages/keras/engine/saving.py”, line 419, in load_model
model = _deserialize_model(f, custom_objects, compile)
File “/usr/local/lib/python2.7/dist-packages/keras/engine/saving.py”, line 225, in _deserialize_model
model = model_from_config(model_config, custom_objects=custom_objects)
File “/usr/local/lib/python2.7/dist-packages/keras/engine/saving.py”, line 458, in model_from_config
return deserialize(config, custom_objects=custom_objects)
File “/usr/local/lib/python2.7/dist-packages/keras/layers/init.py”, line 55, in deserialize
printable_module_name=‘layer’)
File “/usr/local/lib/python2.7/dist-packages/keras/utils/generic_utils.py”, line 145, in deserialize_keras_object
list(custom_objects.items())))
File “/usr/local/lib/python2.7/dist-packages/keras/engine/network.py”, line 1022, in from_config
process_layer(layer_data)
File “/usr/local/lib/python2.7/dist-packages/keras/engine/network.py”, line 1008, in process_layer
custom_objects=custom_objects)
File “/usr/local/lib/python2.7/dist-packages/keras/layers/init.py”, line 55, in deserialize
printable_module_name=‘layer’)
File “/usr/local/lib/python2.7/dist-packages/keras/utils/generic_utils.py”, line 138, in deserialize_keras_object
': ’ + class_name)
ValueError: Unknown layer: BatchNormalizationV1
How can I fix it?

Morganh · November 7, 2019, 8:08am

Hi chuongvodoi95,
The quantity of classes depends on your dataset. It has nothing to do with resnet18.
If you want to train 2 classes, you need to split your dataset into 2 classes and make sure each train/val/test folder have two classes.
For example,
$ ls train
cat dog
$ ls val
cat dog
$ ls test
cat dog

Folder cat have lots of cat images.
Folder dog have lots of dog images.

chuongvodoi95 · November 7, 2019, 8:41am

Thank you so much Morganh

Topic		Replies	Views
TLT, classification, How to set the number of classes for training the custom dataset TAO Toolkit	5	520	October 12, 2021
Is resize needed for training a classification model? TAO Toolkit	2	318	September 28, 2022
Error: Transfer learning toolkit for classification failed to setting image size TAO Toolkit	9	913	October 12, 2021
Not getting good result while training model with resnet-10 and resnet-18 using TLT TAO Toolkit	4	523	October 12, 2021
Inferring resnet18 classification etlt model with python TAO Toolkit	45	4101	October 12, 2021
Sample detection notebook downloads TLT model ,but not using it ? TAO Toolkit	3	763	October 12, 2021
Training Custom Object detector with 6 classes TAO Toolkit	27	2270	October 12, 2021
Use an old .tlt model to retrain it with a new dataset TAO Toolkit training	7	1134	January 25, 2022
Train with my own tlt model TAO Toolkit	14	732	December 13, 2021
Change prediction layer of pretrained model TAO TAO Toolkit tao	8	194	July 8, 2024

Training Classification model from scratch

Setting these parameters to true to match the template downloaded from NGC.

regularizer

learning_rate

Related topics