Tao unet "TrainingConfig" has no field named "activation"

david9xqqb · November 3, 2022, 6:51pm

• Hardware RTX3090
• Network Type unet/resnet
• TLT Version

Configuration of the TAO Toolkit Instance
dockers: [‘nvidia/tao/tao-toolkit-tf’, ‘nvidia/tao/tao-toolkit-pyt’, ‘nvidia/tao/tao-toolkit-lm’]
format_version: 2.0
toolkit_version: 3.22.02
published_date: 02/28/2022

• Training spec file (see blow…)

• How to reproduce the issue ?

Created a specs file with

training_config {
  batch_size: 6
  epochs: 200
  log_summary_steps: 10
  checkpoint_interval: 1
  loss: "cross_dice_sum"
  activation: "sigmoid"
  learning_rate:0.0001
  regularizer {
    type: L2
    weight: 2e-5
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
}

But get an error

“TrainingConfig” has no field named “activation”.

Specs file here (1.4 KB)

Morganh · November 4, 2022, 3:41am

Please set it inside model_config.
Seems that there is mismatching in user guide.

david9xqqb · November 4, 2022, 11:15am

I did that,

model_config {
  model_input_width: 1280
  model_input_height: 704
  model_input_channels: 1
  num_layers: 18
  all_projections: true
  arch: "resnet"
  use_batch_norm: False
  training_precision {
    backend_floatx: FLOAT32
  }
  activation: "sigmoid"
}

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Incompatible shapes: [6,1,1280,704] vs. [6,1,704,1280]
[[{{node logistic_loss/mul}}]]
[[Add_11/_2127]]
(1) Invalid argument: Incompatible shapes: [6,1,1280,704] vs. [6,1,704,1280]
[[{{node logistic_loss/mul}}]]

The complete specs is here unet_train_resnet_6S200.txt (1.4 KB)

And the complete training lor with the error is here bad training.txt (54.7 KB)

And the train command:


!tao unet train --gpus=1 --gpu_index=$GPU_INDEX \
              -e $SPECS_DIR/unet_train_resnet_6S200.txt \
              -r $USER_EXPERIMENT_DIR/unpruned \
              -m $USER_EXPERIMENT_DIR/pretrained_resnet18/resnet_18.hdf5  \
              -n model \
              -k $KEY

Morganh · November 4, 2022, 12:27pm

From your log, I can see below.
model_input_height: 704
model_input_width: 1280

It is different from your sharing above.

Can you use a new result folder and retry?

david9xqqb · November 6, 2022, 3:19pm

In what way is it different? To the contrary, my spec file and images are all height 704 and width 1280!!

Created a new folder (called something) for results and run

!tao unet train --gpus=1 --gpu_index=$GPU_INDEX \
              -e $SPECS_DIR/unet_train_resnet_6S.txt \
              -r $USER_EXPERIMENT_DIR/something \
              -m $USER_EXPERIMENT_DIR/pretrained_resnet18/resnet_18.hdf5  \
              -n model \
              -k $KEY

And this time provided a similar but slightly different error…

(0) Invalid argument: Incompatible shapes: [3,1,1280,704] vs. [3,1,704,1280]
[[{{node logistic_loss/mul}}]]
[[total_loss_ref/_2107]]
(1) Invalid argument: Incompatible shapes: [3,1,1280,704] vs. [3,1,704,1280]
[[{{node logistic_loss/mul}}]]

This is all related to using activation: "sigmoid" If I remove that single line from the specs file, everything runs to completion (with the problems reported in my other two posts… )

unet_train_resnet_6S.txt (1.4 KB)
bad train 2022 11 06.txt (54.7 KB)

Morganh · November 6, 2022, 3:40pm

OK, my bad. It is 1280x704. There is no difference.

So, will the default notebook meet the similar error with activation: “sigmoid” ?

david9xqqb · November 6, 2022, 9:38pm

What do you mean by default notebook ?? The isbi notebook? It took me a couple of hours to get it to run, and no. This doesn’t happen on the isbi notebook, but looking in deep into that example, the images are 320 by 320, which would not cause this issue. More wasted time…

Besides the major reduction in performance when exporting, MAJOR ACCURACY LOSS when EXPORTING tao unet model after retraining pruned model , I am having issues with the order of the buffer resulting from the tensorRT inference.

Let me explain:

I had trained a unet multiclass semantic segmentation model with color data of shape 704X1280…

All the processing completes, but the performance is very poor. Very very poor…

I have a C++ inference program that works, in the sense of taking the live feed data, normalizing, and pushing it into a cuda buffer by unfolding the rgb channels and doing NHWC to NCHW conversion, and it all works. With poor segmentation classification performance, but correct in regards to pixel placement.

Now, I am trying to improve performance by training a binary model to detect the one critical part of the image. After completing a full cycle of training, in addition to the major performance drop when exporting, the pixel placement is off, as if the columns and rows were swapped behind the scenes.

My estimation is that Nvidia unet has a completely different programing for grayscale-binary and for color-multi-class, and is completely different code… That’s why I think there is a major drop in performance when exporting, and sigmoid doesn’t work with images that are nor of square shape, and the inference is wrong in pixel placement…

Morganh · November 7, 2022, 10:46am

Yes, thanks for the time to narrow down.

Please refer to
Problems encountered in training unet and inference unet - #27 by Morganh .

I cannot reproduce the accuracy drop when exporting. Also, I was training a model of 960x544.

For sigmoid activation, I will check if I can reproduce later.

Morganh · November 7, 2022, 3:43pm

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one.
Thanks

Update: I train the fire dataset again mentioned in Problems encountered in training unet and inference unet - #26 with activation: “sigmoid” .
The training can go smoothly without error.

system · December 6, 2022, 6:17am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.