Epoch and checkpoint number association formula for Detectnet_V2 not lining up

rohan.zeng · June 8, 2022, 5:20pm

I was looking over a previous post that details the process of getting the epoch associated with a checkpoint, at the following forum link: [Get epoch associated with a checkpoint in Detectnet_V2)

However, when I check it against a Detectnet_V2 model that was just trained using TAO, the numbers don’t work out to any capacity, and I’m wondering whether that is just due to how the model was set up and trained versus what the formula expects.

For context of the model setup and the input values to the formula:

The saved checkpoint is model.step-248640 and the associated epoch is 280
The model was trained on 16 gpus and the batch size per GPU (in the training configuration file) is 16
The train kitti config and validation kitti config were both generated with “random” partition mode (so 2 folds) and a val_split value of 1, as well as a num_shards value of 8. During training, the dataloader section of the configuration file is provided with both the training and validation data sources. In total, there are 149677 images in the training folder as well as 6437 images in the validation folder.

As the val_split and the actual validation set used in the training are of different values, I’m a bit confused as to which value would be applicable if any, whether the validation split should be 1% = 0.01 or 6437/(6437+149677) = 0.0412. Regardless of which one is used the formula still doesn’t output expected values.

Any thoughts would be appreciated!

Morganh · June 9, 2022, 11:03am

Can you share the log when you generate tfrecord files?
And also please share its spec file and training spec.

Morganh · June 9, 2022, 3:37pm

You can find similar info in training log.
For example,

2022-06-09 15:28:42,206 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 6434, number of sources: 1, batch size per gpu: 4, steps: 1609

Then 1609 will be the one of the checkpoint number.

rohan.zeng · June 13, 2022, 5:20pm

I looked within my training log and I unfortunately can’t find the line you mentioned above. Instead, whenever a checkpoint is saved, I get the following:

INFO:tensorflow:Saving checkpoints for step-8880.
2022-06-03 19:40:49,274 [INFO] tensorflow: Saving checkpoints for step-8880.

Otherwise, the sample output per line is something like the following:
2022-06-03 19:40:43,871 [INFO] tensorflow: epoch = 29.966216216216218, loss = 0.0011594478, step = 8870 (6.028 sec)
INFO:tensorflow:global_step/sec: 1.65613

I unfortunately don’t know if the log when the tfrecord files were generated is available, but here are the spec files for the training set of tfrecords and the training spec. Note that I’ve replaced some of the actual paths with <placeholders_for_paths>.

training_detectnet_v2_spec_config (4.0 KB)
train_dataset_spec_config (290 Bytes)

Morganh · June 14, 2022, 2:25am

My log is just an example. Can you share your full log?

rohan.zeng · June 16, 2022, 4:59pm

Here is the full log with paths and users substituted:

logfile.tar.gz (10.0 MB)

Morganh · June 17, 2022, 1:47am

From your log, there is below description for training dataset.

[1,14]:2022-06-04 18:08:42,534 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 227167, number of sources: 1, batch size per gpu: 16, steps: 888

The steps are 888. So the checkpoint number is related to it.

rohan.zeng · June 23, 2022, 6:09pm

I guess the question still remains that I’m unclear how exactly the checkpoint number relates to the epochs. I understand that the checkpoint number and saved model name is related to the steps, but that doesn’t solve the main question which that the values I have do not work out fundamentally with the reference formula.

Morganh · June 24, 2022, 2:16am

You have 16 shards according to below.

[1,15]<stderr>:2022-06-04 18:08:44,020 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 15 of 16

Then,
The sharded_size is the ceil of the (training images / shard_count ), i.e., ceil(227167 / 16 )= 14198

The step is ceil(14198 / batch_size) , i.e., 888

rohan.zeng · June 24, 2022, 5:07pm

Thank you for the clarification, and apologies if I’m still just missing it, but I don’t quite see how this necessarily relates epochs to steps/checkpoint number, as per my original question. The reference formula in question is

which was obtained from the link in my initial post. Am I to take it that there is no relation between the epochs and steps?

Morganh · June 25, 2022, 10:22am

Firstly, please find the log similar to below or something else.

2022-06-25 09:45:41,519 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 1

So, the shard_count is 1. Actually the shard_count here is the gpu numbers.

If consider the shards and also ignore val_split, I modify it as below.

checkpoint_number = epoch_number * ceil (ceil((training_images ) / gpu_numbers )/ batch_size_per_gpu )

For example, if running at 1st epoch, for your case, the gpu numbers are 16.
checkpoint_number_1st_epoch = 1 * ceil (ceil (227167 / 16 ) / 16 ) = 888
checkpoint_number_2nd_epoch = 2 * ceil (ceil (227167 / 16 ) / 16 )= 1776
…

rohan.zeng · June 28, 2022, 5:16pm

Sounds great, this clarifies things up. Thank you, I really appreciate your patience with me and this question!

system · July 12, 2022, 5:17pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Get epoch associated with a checkpoint in Detectnet_V2 TAO Toolkit	4	450	October 12, 2021
Detectnet_v2 checkpoint interval unexpected results TAO Toolkit	9	841	October 12, 2021
Detectnet_v2 Resume Training from Checkpoint TAO Toolkit	11	1319	October 12, 2021
BodyPoseNet trained with custom dataset not detecting TAO Toolkit	21	851	June 6, 2022
TAO Toolkit trainung Unet stops when saving checkpoints TAO Toolkit	19	74	September 3, 2024
Classification checkpoints TAO Toolkit	10	543	July 1, 2022
Custom dataset-ValueError: steps_per_epoch must be > 0 TAO Toolkit	5	810	October 12, 2021
Class Mapping in Detectnet_v2 TAO Toolkit	5	565	October 12, 2021
One class missing from tfrecords- Training stops with mAP equal to 0 TAO Toolkit	8	587	April 3, 2022
TAO toolkit 5.3 actionrecognitionnet training error for joint model, network shape mismatch TAO Toolkit	5	32	December 5, 2024

Epoch and checkpoint number association formula for Detectnet_V2 not lining up

Related topics