Please docker pull TAO5.0 docker(nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5) and retry. Because it is the latest version for Unet. Also, you can find the source code inside the docker to help to debug.
The source code is in https://github.com/NVIDIA/tao_tensorflow1_backend/blob/2ec95cbbe0d74d6a180ea6e989f64d2d97d97712/nvidia_tao_tf1/cv/unet/utils/model_fn.py#L97-L101.
Seems that you set a large value for decay_steps.
You can check the decay_steps: 650800 with below code.
An example,
import matplotlib.pyplot as plt
import tensorflow as tf
y = []
z = []
EPOCH = 200
global_step = tf.Variable(0, name='global_step', trainable=False)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for global_step in range(EPOCH):
learning_rate1 = tf.train.cosine_decay(
learning_rate=0.1, global_step=global_step, decay_steps=50)
learning_rate2 = tf.train.cosine_decay(
learning_rate=0.1, global_step=global_step, decay_steps=100)
lr1 = sess.run([learning_rate1])
lr2 = sess.run([learning_rate2])
y.append(lr1)
z.append(lr2)
x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'b-', linewidth=2)
plt.title('cosine_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels=['decay_steps=50', 'decay_steps=100'],z loc='upper right')
plt.show()
Result of above example,
BTW, there is another issue for TAO 5.0 as well. Need to change the code according to UNet training progress counter frozen after ~18.000 steps - #19 by Morganh.
