Status of fp16 resnet-50 training?

yaroslavvb · December 7, 2017, 9:33pm

What is the status of fp16 training of resnet on Volta GPUs using NGC containers? I’ve been told 5k images/second is achievable on DGX-1 volta, is this performance

achievable using TensorFlow from NVidia NGC container
available for both training and inference
usable for actual training (ie, train in fp16 end-to-end without significant accuracy loss)

CTierney · December 9, 2017, 12:15am

The NGC containers are based on the same development work as the DGX-1 containers. As to your questions:

If you were to run the NGC container on a DGX-1 performance should be similar. Performance will vary based on your actual system architecture.
Yes, but see #1 for caveats.
Yes, you should be able to used mixed-precision training and not only get improved training performance but also see training converge in the same number of epochs. Again, convergence may vary depending on what exactly you are doing. However, for many models, you should see similar training accuracy. See the following link for details:

https://devblogs.nvidia.com/parallelforall/mixed-precision-training-deep-neural-networks/

This blog entry references a couple of documents about accuracy with mixed-precision training that should explain in more detail what you can expect using FP16 and Tensorcores.

Topic		Replies	Views
NVIDIA pre-trained models for mixed precision networks Frameworks (archived) tensorflow	0	612	September 26, 2018
Accelerating TensorFlow on NVIDIA A100 GPUs Technical Blog	0	531	August 25, 2020
Mixed-Precision ResNet-50 Using Tensor Cores with TensorFlow Technical Blog	2	426	March 7, 2019
Mixed-Precision Programming with CUDA 8 Technical Blog	1	404	February 23, 2017
Unexpected low fp16 performance on P3 Frameworks (archived) tensorflow	4	2420	October 12, 2021
is FP16 running only on the Volta? TensorRT	8	2956	October 12, 2021
NGC SpeechSynthesis(Tacotron2) example's expected training time is not clear in the documentation Docker and NVIDIA Docker	0	608	April 17, 2019
Tesla V100 Performance (AWS-P3X16) Frameworks (archived) tensorflow	0	970	June 22, 2018
Using FP16 precision mode on Tesla P4 TensorRT	1	3696	September 11, 2018
slower performance in container when using V100 Frameworks (archived) tensorflow	2	1444	June 15, 2018

Status of fp16 resnet-50 training?

Related topics