Tips for Optimizing GPU Performance Using Tensor Cores

jwitsoe · June 10, 2019, 1:11pm

Originally published at: Tips for Optimizing GPU Performance Using Tensor Cores | NVIDIA Technical Blog

Our most popular question is “What can I do to get great GPU performance for deep learning?” We’ve recently published a detailed Deep Learning Performance Guide to help answer this question. The guide explains how GPUs process data and gives tips on how to design networks for better performance. We also take a close look at Tensor Core…

anon51478341 · June 19, 2019, 11:57am

Thanks for the post!

I have a question about enabling Tensor cores. I wonder where should we set the value for "the batch size and number of inputs and outputs, for a fully-connected layer and channels in and out, for a convolutional layer" ?

anon89707364 · June 19, 2019, 6:03pm

Glad you enjoyed the post!

That depends on how you are running your network. In our APIs and most frameworks, you can specify these parameters when you define a layer and its inputs and outputs. Are you using cuBLAS or cuDNN, or a particular framework?

anon47084024 · June 20, 2019, 3:03am

Hi Valerie, this blog said, "Earlier versions of cuDNN required the channel dimension of all tensors be a multiple of 8. That constraint no longer applies to packed NCHW data; cuDNN now automatically pads the tensors as needed."
but in this paper: "We recommend ensuring all such parameters are multiples of 8 when training with FP16 and multiples of 16 when training with INT8. These include batch size and number of inputs and outputs, for a fully-connected layer and channels in and out, for a convolutional layer."
For a convolutional layer, is it necessary to ensure channel dimensions are multiples of 8 ?

anon89707364 · June 20, 2019, 5:18am

This is a very good question! The blog you linked to is correct: with data in the NCHW layout, cuDNN performs automatic padding of channel in and out counts of convolutional layers, so in that case Tensor Cores will activate even when channels in and out are not set to multiples of 8.

For brevity, this post focused on the strictest version of these rules: when using data in the NHWC layout, automatic padding won't occur. We talk about the difference between these formats in the Tensor Layouts section of the Deep Learning Performance Guide, if you'd like to read more. The Channels In and Out section of the guide also explains in more detail how this affects the rules for channel counts. (Channels In and Out describes special case kernels for layers with four input channels as well, which may be of interest!)

anon51478341 · June 20, 2019, 4:11pm

Thanks for the reply. I am using Caffe, I see that we can define a layer and its outputs, I guess the value of inputs in this case will be the outputs from last layer. But I am not sure I can define the batch size.

anon51478341 · June 20, 2019, 4:14pm

Hi,

I wonder if Tensor cores have wave quantization problem? Since different GPUs have different numbers of Tensor cores.

anon89707364 · June 20, 2019, 10:57pm

I'm a little rusty with Caffe, but if memory serves, the batch size is controlled by the shape of the tensor you use as input during training or inference, which is probably defined in the data layer. So the first dimension of your net.blobs, or other form of data tensor, would be the batch size for all layers.

anon51478341 · June 21, 2019, 12:43am

Which framework works best for Tensor cores?

anon89707364 · June 21, 2019, 4:23am

The overall Tensor Core count of a GPU doesn't have a separate wave quantization effect. SM-based wave quantization, the sort that we talk about in this post, occurs because layer parameters can be set to any value. So we can choose an inefficient batch size such that the training work can't be divided evenly among SMs once split into tiles / thread blocks. You don't need to worry about this issue at the Tensor Cores level because the tile sizes available are designed to allow efficient work by the set of Tensor Cores in an SM!

anon89707364 · June 21, 2019, 4:32am

That question is very complex! There isn't any single preferred framework. Our Training With Mixed Precision guide, and in particular this section explaining how to set up and optimize for Tensor Cores in various frameworks, might be a good place to start.

anon51478341 · June 21, 2019, 8:58am

I see. Thank you!

anon51478341 · June 21, 2019, 9:10am

Do you mean that the tie sizes of Tensor cores are more flexible than the tie size options in cuBLAS?

anon89707364 · June 22, 2019, 12:11am

My wording wasn't precise, sorry! By tile sizes, I mean those available in cuBLAS. This sort of tiling doesn't occur at the Tensor Cores level.

To illustrate, consider our feed-forward layer example from the post again. With a batch size of 2048, the equivalent output matrix would have dimensions of 4096 x 2048. Assuming the 256 x 128 tile size is used, 16 x 16 = 256 total tiles are created. These tiles can't be split evenly between the 80 SMs on a Tesla V100 GPU, so this case suffers from wave quantization.

With a tile size of 256 x 128, each SM handles the thread block for one tile at a time. The amount of work done by this thread block is controlled by the tile size, and we design the available tile sizes such that the corresponding thread blocks can be calculated jointly by the Tensor Cores on an SM with maximum efficiency. So you don't need to worry about wave quantization at this level.

Put another way: a Tesla V100 GPU has 80 SMs, and each SM has 8 Tensor Cores, for a total of 640 Tensor Cores on the GPU. However, wave quantization depends directly on the number of SMs and the tile size; the number of Tensor Cores isn't itself relevant. (Your intuition that the number of Tensor Cores affects quantization is correct in that the number of Tensor Cores is 8 times the number of SMs on this GPU; the information is already being taken into account!)

Hope this makes it clearer!

anon51478341 · June 22, 2019, 12:25am

I see it now. Thank you so much for the reply! This answer is great!

anon47084024 · July 24, 2019, 8:42am

Thanks for your reply! I have another question.
In the cuDNN Developer Guide:
"For algorithms other than *_ALGO_WINOGRAD_NONFUSED, when the following requirements are met, the cuDNN library will trigger the Tensor Core operations: The number of input and output feature maps is a multiple of 8."
Question: For algorithms *_ALGO_WINOGRAD_NONFUSED, what are the requirements ? Because the TF_ENABLE_WINOGRAD_NONFUSED variable is enabled by default

Topic		Replies	Views
Programming Tensor Cores in CUDA 9 Technical Blog	14	1209	November 28, 2022
Perfomance question for Tesla V100 CUDA Programming and Performance	11	2571	May 24, 2017
GPU cuda cores or Tensor cores Jetson AGX Xavier cuda	2	984	October 18, 2021
CUDA cores vs Tensor Cores Jetson AGX Xavier cuda , nvbugs	16	4890	October 18, 2021
Low performance for convolution in cuDNN on Tesla V100 cuDNN	5	2124	August 2, 2018
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2622	August 12, 2017
Programming Tensor Cores in CUDA 9 Technical Blog	0	261	August 21, 2022
Nvidia GF104 vs GF100 CUDA Programming and Performance	24	23057	October 12, 2010
GeForce GTX 460 & CUDA 3.1 (What is deviceQuery reporting?) CUDA Programming and Performance	8	10914	August 15, 2010
Cores in Tesla c2050 card shows 112 cores instead of 448 CUDA Programming and Performance	6	11273	September 4, 2010

Tips for Optimizing GPU Performance Using Tensor Cores

Related topics