Hi,
I’m currently trying to fine tune trafficnet object detector to a custom dataset I have, however the issue that I currently is that for some reason the model is only training for a single class, here is an example of the evaluation after 10th epoch,
Epoch 10/10
=========================
Validation cost: 0.000024
Mean average_precision (in %): 14.6148
class name average precision (in %)
------------ --------------------------
bike 0
car 43.8443
pedestrian 0
Is there something wrong with the train configuration file that I’m currently using. I’m linking the file below. detectnet_v2_train_resnet18_kitti.txt (5.5 KB)
Hey @Morganh, I think there’s a bug in parsing the specification file. I’m not sure. But these are my findings.
The resnet18.tlt is an unpruned traffic camnet tlt model.
The target class mapping that is used in the spec file doesn’t work I think, looking at the way it was working or my understanding of it could be different.
What I was guessing is that, let’s say in the default model trained has two classes named pedestrian and person sitting being mapped to pedestrian, I was expecting it to merge both class into a single class for training because the default specification file looked like it.
So here in my case I was planning to merge all 4-wheelers into a single class and other classes appropriately. And the cost functions for target classes are for the value fields in the key value mappings from above. But that wasn’t the case happening. When I was training the reason why I had only car class being considered for precision was that (which was entirely my luck) both key and value for car class matched, as in there was a key car and value car but any other class bus → car, person → pedestrian etc wasn’t being considered.
Could this be a possible bug or am I mistaken?
@Morganh, I believe that I should’ve been more clearer, what I was trying to make in the above config file is map bus, truck, car (all vehicles with 4 wheels) to a single class car. The config file attached is the latest one.
It is fine to merge all 4-wheelers into a single class.
Reference in tlt user guide.
target_class_mapping : This parameter maps the class names in the tfrecords to the target class to be trained in the network. An element is defined for every source class to target class mapping. This field was included with the intention of grouping similar class objects under one umbrella. For eg: car, van, heavy_truck etc may be grouped under automobile. The “key” field is the value of the class name in the tfrecords file, and “value” field corresponds to the value that the network is expected to learn.
Hey @Morganh, I think that is the issue that I’m trying to point out. I fell there is some issue in the mapping in tlt. It doesn’t get mapped when I used it. Could you please check if there is any issue in the cfg I had attached earlier. Thanks.
@Morganh, what are the approaches that could be taken when the validation accuracy isn’t improving or is at a local minima? My current accuracy is varying between 71 and 58 to 68 and 50 respectively for two classes. Any approaches that I could take to improve it?
Hey @Morganh, I think that is the issue that I’m trying to point out. I fell there is some issue in the mapping in tlt. It doesn’t get mapped when I used it. Could you please check if there is any issue in the cfg I had attached earlier. Thanks.
No issue in your spec’s target_class_mapping. It is ok.
@Morganh, what are the approaches that could be taken when the validation accuracy isn’t improving or is at a local minima? My current accuracy is varying between 71 and 58 to 68 and 50 respectively for two classes. Any approaches that I could take to improve it?
Please try to finetune the hyper-parameters.
You can try bs 16. And enlarge the epoch. Seems that 10 epochs is a bit short.
You can also modify learning rate. More experiment are needed.
Hmm yes increasing batch size, I’m right now constrained by a machine. And I’ve increased the epochs to 70 and from 30 th epoch I’ve noticed this behaviour. The precision jumping around global optimum. Right now I can change the batch size from 4-8 but for some reason I can change it only during start when trying to stop and change batch size the train process seems to fail. Any idea on it?
I cannot understand the meaning of “but for some reason I can change it only during start when trying to stop and change batch size the train process seems to fail.”
Can you share the failed log? Thanks.
@Morganh, here is the log that I get when I start the training let’s say with batch size 4 and then stop the train process and restart the training process with an increased batch size 8.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: assertion failed: [2.66666675]
[[{{node Assert/AssertGuard/Assert}}]]
[[resnet18_nopool_bn_detectnet_v2/block_4b_bn_2/AssignMovingAvg/_4217]]
(1) Invalid argument: assertion failed: [2.66666675]
[[{{node Assert/AssertGuard/Assert}}]]
0 successful operations.
0 derived errors ignored.
The moment I revert the batch size the training continues from where it was left off.