I want to speed up learning with TAO on EC2

I ran TAO training on both p3.2xlarge and p3.16xlarge.
However, there was not much difference in comparison.

The p3.16xlarge has a better GPU, but is it not able to train using the GPU specs to the fullest?
If so, how can I maximize the use of the GPU?

■ TAO training log for p3.2xlarge
command

!tao detectnet_v2 train -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt \
                        -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                        -k $KEY \
                        -n resnet18_detector \
                        --gpus 1
2022-06-09 08:02:01,930 [INFO] tensorflow: epoch = 84.7493654822335, learning_rate = 0.00045429557, loss = 6.0011807e-05, step = 133565 (5.151 sec)
2022-06-09 08:02:03,130 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 47.011
2022-06-09 08:02:06,036 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 34.413
INFO:tensorflow:epoch = 84.78172588832487, learning_rate = 0.00045241896, loss = 0.00011500824, step = 133616 (5.094 sec)
2022-06-09 08:02:07,024 [INFO] tensorflow: epoch = 84.78172588832487, learning_rate = 0.00045241896, loss = 0.00011500824, step = 133616 (5.094 sec)
INFO:tensorflow:global_step/sec: 11.311
2022-06-09 08:02:07,025 [INFO] tensorflow: global_step/sec: 11.311
2022-06-09 08:02:08,087 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.760
2022-06-09 08:02:10,154 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.379
INFO:tensorflow:epoch = 84.82106598984771, learning_rate = 0.0004501476, loss = 0.00012022845, step = 133678 (5.109 sec)
2022-06-09 08:02:12,133 [INFO] tensorflow: epoch = 84.82106598984771, learning_rate = 0.0004501476, loss = 0.00012022845, step = 133678 (5.109 sec)
2022-06-09 08:02:12,217 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.487
2022-06-09 08:02:14,307 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 47.849
2022-06-09 08:02:16,387 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.090
INFO:tensorflow:epoch = 84.86040609137055, learning_rate = 0.00044788793, loss = 0.00011997479, step = 133740 (5.148 sec)
2022-06-09 08:02:17,281 [INFO] tensorflow: epoch = 84.86040609137055, learning_rate = 0.00044788793, loss = 0.00011997479, step = 133740 (5.148 sec)
2022-06-09 08:02:18,467 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.091
INFO:tensorflow:global_step/sec: 12.0438
2022-06-09 08:02:20,061 [INFO] tensorflow: global_step/sec: 12.0438
2022-06-09 08:02:20,573 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 47.493
INFO:tensorflow:epoch = 84.8991116751269, learning_rate = 0.00044567612, loss = 9.090778e-05, step = 133801 (5.084 sec)
2022-06-09 08:02:22,364 [INFO] tensorflow: epoch = 84.8991116751269, learning_rate = 0.00044567612, loss = 9.090778e-05, step = 133801 (5.084 sec)
2022-06-09 08:02:22,609 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 49.131
2022-06-09 08:02:24,720 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 47.375
2022-06-09 08:02:26,777 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.609
INFO:tensorflow:epoch = 84.93845177664974, learning_rate = 0.00044343888, loss = 6.7681714e-05, step = 133863 (5.146 sec)
2022-06-09 08:02:27,510 [INFO] tensorflow: epoch = 84.93845177664974, learning_rate = 0.00044343888, loss = 6.7681714e-05, step = 133863 (5.146 sec)
2022-06-09 08:02:28,854 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.171
2022-06-09 08:02:30,966 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 47.339
INFO:tensorflow:epoch = 84.97779187817258, learning_rate = 0.00044121308, loss = 5.957734e-05, step = 133925 (5.140 sec)
2022-06-09 08:02:32,650 [INFO] tensorflow: epoch = 84.97779187817258, learning_rate = 0.00044121308, loss = 5.957734e-05, step = 133925 (5.140 sec)
2022-06-09 08:02:32,981 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 49.645
INFO:tensorflow:global_step/sec: 12.0718
2022-06-09 08:02:33,066 [INFO] tensorflow: global_step/sec: 12.0718
2022-06-09 08:02:35,074 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 47.787
2022-06-09 08:02:35,561 [INFO] root: None
2022-06-09 08:02:35,561 [INFO] iva.detectnet_v2.tfhooks.task_progress_monitor_hook: Epoch 85/120: loss: 0.00005 learning rate: 0.00044 Time taken: 0:02:11.857269 ETA: 1:16:55.004400
2022-06-09 08:02:37,143 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 48.331

■ TAO training log for p3.16xlarge
command

!tao detectnet_v2 train -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt \
                        -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                        -k $KEY \
                        -n resnet18_detector \
                        --gpus 8
INFO:tensorflow:global_step/sec: 9.69944
2022-06-09 08:06:39,635 [INFO] tensorflow: global_step/sec: 9.69944
2022-06-09 08:06:39,942 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 310.953
INFO:tensorflow:global_step/sec: 9.69271
2022-06-09 08:06:41,595 [INFO] tensorflow: global_step/sec: 9.69271
2022-06-09 08:06:42,542 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 307.691
INFO:tensorflow:global_step/sec: 9.44691
2022-06-09 08:06:43,607 [INFO] tensorflow: global_step/sec: 9.44691
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0007937764, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0007937764, step = 2119 (5.123 sec)
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.00085954077, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.00085954077, step = 2119 (5.123 sec)
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.00067783205, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.00067783205, step = 2119 (5.123 sec)
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0006431361, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0006431361, step = 2119 (5.123 sec)
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0008180385, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0008180385, step = 2119 (5.123 sec)
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0006592367, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0006592367, step = 2119 (5.123 sec)
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0007017506, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.0007017506, step = 2119 (5.123 sec)
INFO:tensorflow:epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.00066348375, step = 2119 (5.123 sec)
2022-06-09 08:06:44,653 [INFO] tensorflow: epoch = 10.756345177664974, learning_rate = 0.00031023743, loss = 0.00066348375, step = 2119 (5.123 sec)
2022-06-09 08:06:45,178 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 303.531
INFO:tensorflow:global_step/sec: 9.57317
2022-06-09 08:06:45,591 [INFO] tensorflow: global_step/sec: 9.57317
INFO:tensorflow:global_step/sec: 9.58015
2022-06-09 08:06:47,574 [INFO] tensorflow: global_step/sec: 9.58015
2022-06-09 08:06:47,776 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 308.004
INFO:tensorflow:global_step/sec: 9.67228
2022-06-09 08:06:49,539 [INFO] tensorflow: global_step/sec: 9.67228
2022-06-09 08:06:49,652 [INFO] root: None
2022-06-09 08:06:49,652 [INFO] iva.detectnet_v2.tfhooks.task_progress_monitor_hook: Epoch 11/120: loss: 0.00082 learning rate: 0.00034 Time taken: 0:00:20.480506 ETA: 0:37:12.375148
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00067561684, step = 2169 (5.205 sec)
2022-06-09 08:06:49,858 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00067561684, step = 2169 (5.205 sec)
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.0007094919, step = 2169 (5.206 sec)
2022-06-09 08:06:49,858 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.0007094919, step = 2169 (5.206 sec)
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.0006856213, step = 2169 (5.206 sec)
2022-06-09 08:06:49,858 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.0006856213, step = 2169 (5.206 sec)
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00069529866, step = 2169 (5.206 sec)
2022-06-09 08:06:49,858 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00069529866, step = 2169 (5.206 sec)
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.000748715, step = 2169 (5.206 sec)
2022-06-09 08:06:49,859 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.000748715, step = 2169 (5.206 sec)
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00065234513, step = 2169 (5.206 sec)
2022-06-09 08:06:49,858 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00065234513, step = 2169 (5.206 sec)
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.0007180649, step = 2169 (5.206 sec)
INFO:tensorflow:epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00072094, step = 2169 (5.206 sec)
2022-06-09 08:06:49,859 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.0007180649, step = 2169 (5.206 sec)
2022-06-09 08:06:49,858 [INFO] tensorflow: epoch = 11.010152284263958, learning_rate = 0.0003419758, loss = 0.00072094, step = 2169 (5.206 sec)
2022-06-09 08:06:50,390 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 306.048
INFO:tensorflow:global_step/sec: 9.4631
2022-06-09 08:06:51,547 [INFO] tensorflow: global_step/sec: 9.4631
2022-06-09 08:06:53,004 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 306.110
INFO:tensorflow:global_step/sec: 9.55692
2022-06-09 08:06:53,535 [INFO] tensorflow: global_step/sec: 9.55692
INFO:tensorflow:epoch = 11.258883248730964, learning_rate = 0.00037622757, loss = 0.00079730956, step = 2218 (5.155 sec)
INFO:tensorflow:epoch = 11.258883248730964, learning_rate = 0.00037622757, loss = 0.0007509783, step = 2218 (5.155 sec)

■EC2 Instance Specifications

■p3.2xlarge

2022-06-09 07:53:54,485 [INFO] iva.detectnet_v2.tfhooks.task_progress_monitor_hook: Epoch 81/120: loss: 0.00009 learning rate: 0.00050 Time taken: 0:02:17.899911 ETA: 1:29:38.096517
→ about 2 minutes and 20 seconds per epoch

■p3.16xlarge

task_progress_monitor_hook: Epoch 1/120: loss: 0.00104 learning rate: 0.00001 Time taken: 0:00:37.052950 ETA: 1:13:29.301067
→ about 40 seconds per epoch 

p3.16xlarge is 2 minutes faster than p3.2xlarge.
Is this comparison correct?

Yes. You can also compare the total cost time when two trainings are done.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.