Are test images much different from training images?
More, you can run evaluation against the training images to narrow down.
Sent you a private message with examples of both …
It does not make sense since there are different result during the training.
@Morganh I agree it doesn’t make sense,
This is the test spec:
test_isbi.yaml (836 Bytes)
And full tao evaluate log:
evaluate log.txt (7.3 KB)
@Morganh The problem is with the last line of section 6 of the segformer notebook:
print('Rename a model: Note that the training is not deterministic, so you may change the model name accordingly.')
print('---------------------')
# NOTE: The following command may require `sudo`. You can run the command outside the notebook.
!find $HOST_RESULTS_DIR/isbi_experiment/ -name "iter_1000.tlt" | xargs realpath | xargs -I {} mv {} $HOST_RESULTS_DIR/isbi_experiment/isbi_model.tlt
!ls -ltrh $HOST_RESULTS_DIR/isbi_experiment/isbi_model.tlt
-name “iter_1000.tlt”
I did 20,000 iterations but it kept copying the checkpoint file for the 1000th iteration as the final model
More results coming…
@Morganh After 20,000 iterations the tao evaluate
with the val dataset results are
+------------+-------+-------+
| Class | IoU | Acc |
+------------+-------+-------+
| foreground | 15.65 | 18.77 |
| background | 99.92 | 99.98 |
+------------+-------+-------+
After 88,400 iterations and about 20 hours of training:
tao evaluate
on the val dataset
+------------+-------+-------+
| Class | IoU | Acc |
+------------+-------+-------+
| foreground | 29.22 | 40.79 |
| background | 99.92 | 99.97 |
+------------+-------+-------+
However, running tao inference
on a test dataset, it gives an error:
raise MissingMandatoryValue(“Missing mandatory value: $FULL_KEY”)
omegaconf.errors.MissingMandatoryValue: Missing mandatory value: dataset_config.test_ann_dir
full_key: dataset_config.test_ann_dir
reference_type=SFDatasetConfig
object_type=SFDatasetConfig
Why is the inference looking for an annotation (masks) directory?
But the important question is what to do to achieve better performance? Brute force more training time? Which hyperparameters to tweak?
I see the learning rate lr decreases over time … When restarting training from a checkpoint should I start with the same lr as it had at the checkpoint?
@Morganh Thanks for the help in all this!
Dave
That should be a problem. You can use a dummy one or val ann to work around.
Can you share the all the training log and resuming training log?
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks
Please try to add more images for segformer training.
More, please set lower range as below for the resized augmentation.
resize:
img_scale:
- 704
- 1280
ratio_range:
- 0.8
- 1.2
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.