I ran TAO training on both p3.2xlarge and p3.16xlarge.
However, there was not much difference in comparison.
The p3.16xlarge has a better GPU, but is it not able to train using the GPU specs to the fullest?
If so, how can I maximize the use of the GPU?
■ TAO training log for p3.2xlarge
command
!tao detectnet_v2 train -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt \
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
-k $KEY \
-n resnet18_detector \
--gpus 1
2022-06-09 08:02:01,930 [INFO] tensorflow: epoch = 84.7493654822335, learning_rate = 0.00045429557, loss = 6.0011807e-05, step = 133565 (5.151 sec)
2022-06-09 08:02:03,130 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 47.011
2022-06-09 08:02:06,036 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 34.413
INFO:tensorflow:epoch = 84.78172588832487, learning_rate = 0.00045241896, loss = 0.00011500824, step = 133616 (5.094 sec)
2022-06-09 08:02:07,024 [INFO] tensorflow: epoch = 84.78172588832487, learning_rate = 0.00045241896, loss = 0.00011500824, step = 133616 (5.094 sec)
INFO:tensorflow:global_step/sec: 11.311
2022-06-09 08:02:07,025 [INFO] tensorflow: global_step/sec: 11.311
2022-06-09 08:02:08,087 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.760
2022-06-09 08:02:10,154 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.379
INFO:tensorflow:epoch = 84.82106598984771, learning_rate = 0.0004501476, loss = 0.00012022845, step = 133678 (5.109 sec)
2022-06-09 08:02:12,133 [INFO] tensorflow: epoch = 84.82106598984771, learning_rate = 0.0004501476, loss = 0.00012022845, step = 133678 (5.109 sec)
2022-06-09 08:02:12,217 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.487
2022-06-09 08:02:14,307 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 47.849
2022-06-09 08:02:16,387 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.090
INFO:tensorflow:epoch = 84.86040609137055, learning_rate = 0.00044788793, loss = 0.00011997479, step = 133740 (5.148 sec)
2022-06-09 08:02:17,281 [INFO] tensorflow: epoch = 84.86040609137055, learning_rate = 0.00044788793, loss = 0.00011997479, step = 133740 (5.148 sec)
2022-06-09 08:02:18,467 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.091
INFO:tensorflow:global_step/sec: 12.0438
2022-06-09 08:02:20,061 [INFO] tensorflow: global_step/sec: 12.0438
2022-06-09 08:02:20,573 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 47.493
INFO:tensorflow:epoch = 84.8991116751269, learning_rate = 0.00044567612, loss = 9.090778e-05, step = 133801 (5.084 sec)
2022-06-09 08:02:22,364 [INFO] tensorflow: epoch = 84.8991116751269, learning_rate = 0.00044567612, loss = 9.090778e-05, step = 133801 (5.084 sec)
2022-06-09 08:02:22,609 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 49.131
2022-06-09 08:02:24,720 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 47.375
2022-06-09 08:02:26,777 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.609
INFO:tensorflow:epoch = 84.93845177664974, learning_rate = 0.00044343888, loss = 6.7681714e-05, step = 133863 (5.146 sec)
2022-06-09 08:02:27,510 [INFO] tensorflow: epoch = 84.93845177664974, learning_rate = 0.00044343888, loss = 6.7681714e-05, step = 133863 (5.146 sec)
2022-06-09 08:02:28,854 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.171
2022-06-09 08:02:30,966 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 47.339
INFO:tensorflow:epoch = 84.97779187817258, learning_rate = 0.00044121308, loss = 5.957734e-05, step = 133925 (5.140 sec)
2022-06-09 08:02:32,650 [INFO] tensorflow: epoch = 84.97779187817258, learning_rate = 0.00044121308, loss = 5.957734e-05, step = 133925 (5.140 sec)
2022-06-09 08:02:32,981 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 49.645
INFO:tensorflow:global_step/sec: 12.0718
2022-06-09 08:02:33,066 [INFO] tensorflow: global_step/sec: 12.0718
2022-06-09 08:02:35,074 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 47.787
2022-06-09 08:02:35,561 [INFO] root: None
2022-06-09 08:02:35,561 [INFO] iva.detectnet_v2.tfhooks.task_progress_monitor_hook: Epoch 85/120: loss: 0.00005 learning rate: 0.00044 Time taken: 0:02:11.857269 ETA: 1:16:55.004400
2022-06-09 08:02:37,143 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.331
■ TAO training log for p3.16xlarge
command
!tao detectnet_v2 train -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt \
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
-k $KEY \
-n resnet18_detector \
--gpus 8
INFO:tensorflow:global_step/sec: 9.69944
2022-06-09 08:06:39,635 [INFO] tensorflow: global_step/sec: 9.69944
2022-06-09 08:06:39,942 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 310.953
INFO:tensorflow:global_step/sec: 9.69271
2022-06-09 08:06:41,595 [INFO] tensorflow: global_step/sec: 9.69271
2022-06-09 08:06:42,542 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 307.691
INFO:tensorflow:global_step/sec: 9.44691
2022-06-09 08:06:43,607 [INFO] tensorflow: global_step/sec: 9.44691
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0007937764, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0007937764, step = 2119 (5.123 sec)
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.00085954077, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.00085954077, step = 2119 (5.123 sec)
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.00067783205, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.00067783205, step = 2119 (5.123 sec)
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0006431361, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0006431361, step = 2119 (5.123 sec)
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0008180385, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0008180385, step = 2119 (5.123 sec)
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0006592367, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0006592367, step = 2119 (5.123 sec)
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0007017506, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0007017506, step = 2119 (5.123 sec)
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.00066348375, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.00066348375, step = 2119 (5.123 sec)
2022-06-09 08:06:45,178 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 303.531
INFO:tensorflow:global_step/sec: 9.57317
2022-06-09 08:06:45,591 [INFO] tensorflow: global_step/sec: 9.57317
INFO:tensorflow:global_step/sec: 9.58015
2022-06-09 08:06:47,574 [INFO] tensorflow: global_step/sec: 9.58015
2022-06-09 08:06:47,776 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 308.004
INFO:tensorflow:global_step/sec: 9.67228
2022-06-09 08:06:49,539 [INFO] tensorflow: global_step/sec: 9.67228
2022-06-09 08:06:49,652 [INFO] root: None
2022-06-09 08:06:49,652 [INFO] iva.detectnet_v2.tfhooks.task_progress_monitor_hook: Epoch 11/120: loss: 0.00082 learning rate: 0.00034 Time taken: 0:00:20.480506 ETA: 0:37:12.375148
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00067561684, step = 2169 (5.205 sec)
2022-06-09 08:06:49,858 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00067561684, step = 2169 (5.205 sec)
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.0007094919, step = 2169 (5.206 sec)
2022-06-09 08:06:49,858 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.0007094919, step = 2169 (5.206 sec)
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.0006856213, step = 2169 (5.206 sec)
2022-06-09 08:06:49,858 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.0006856213, step = 2169 (5.206 sec)
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00069529866, step = 2169 (5.206 sec)
2022-06-09 08:06:49,858 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00069529866, step = 2169 (5.206 sec)
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.000748715, step = 2169 (5.206 sec)
2022-06-09 08:06:49,859 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.000748715, step = 2169 (5.206 sec)
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00065234513, step = 2169 (5.206 sec)
2022-06-09 08:06:49,858 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00065234513, step = 2169 (5.206 sec)
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.0007180649, step = 2169 (5.206 sec)
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00072094, step = 2169 (5.206 sec)
2022-06-09 08:06:49,859 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.0007180649, step = 2169 (5.206 sec)
2022-06-09 08:06:49,858 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00072094, step = 2169 (5.206 sec)
2022-06-09 08:06:50,390 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 306.048
INFO:tensorflow:global_step/sec: 9.4631
2022-06-09 08:06:51,547 [INFO] tensorflow: global_step/sec: 9.4631
2022-06-09 08:06:53,004 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 306.110
INFO:tensorflow:global_step/sec: 9.55692
2022-06-09 08:06:53,535 [INFO] tensorflow: global_step/sec: 9.55692
INFO:tensorflow:epoch = 11.258883248730964, learning_rate = 0.00037622757, loss = 0.00079730956, step = 2218 (5.155 sec)
INFO:tensorflow:epoch = 11.258883248730964, learning_rate = 0.00037622757, loss = 0.0007509783, step = 2218 (5.155 sec)
■EC2 Instance Specifications