Troubles Replicating TLT Model Training Experiment with TAO


I had to step a way fro this project for a while and my topic got “resolved” so this is me reopening it.

What is the difference between the TLT and TAO tf_record generation?
I looked at the code that Morganh shared in response to this question and it is unclear what they were trying to show me. There was not much explanation or comments in the code that explained how the tfrecords are generated. I assume the “LEGACY” dataloader is the dataloader for TLT. I tried to use this data loader with both TAO and TLT records but both resulted in this error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 1067, in <module>
    raise e
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 1046, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
    return_args = fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 1024, in main
    run_experiment(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 887, in run_experiment
    train_gridbox(results_dir, experiment_spec, output_model_file_name, input_model_file_name,
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 731, in train_gridbox
    evaluator = build_validation_graph(experiment_spec,
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 593, in build_validation_graph
    assert num_validation_samples > 0,\
AssertionError: Validation period is not 0, but no validation data found. Either turn off validation by setting `validation_period = 0` or specify correct path/fold for validation data.
Execution status: FAIL

Another user directly messaged me and they had the exact same problem that I am having. The model performs ~90% on TLT and we can;t get the performance over .50% for TAO. They were able to get good results on both TLT and TAO using an example nvidia detect-net notebook with images that are 1248x384 like Morganh suggested. This is great but the images for our application are 960x544. Why would changing the image size affect the model training performance.

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Please share the training spec file. Above error gives hint that it is needed to set validation_period = 0 or validation_data_source . See more info in DetectNet_v2 - NVIDIA Docs.

Please set enable_auto_resize: true in TAO training and retry. See DetectNet_v2 - NVIDIA Docs.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.