Tips for Optimizing GPU Performance Using Tensor Cores

Originally published at: Tips for Optimizing GPU Performance Using Tensor Cores | NVIDIA Technical Blog

Our most popular question is “What can I do to get great GPU performance for deep learning?” We’ve recently published a detailed Deep Learning Performance Guide to help answer this question. The guide explains how GPUs process data and gives tips on how to design networks for better performance. We also take a close look at Tensor Core…

Thanks for the post!

I have a question about enabling Tensor cores. I wonder where should we set the value for "the batch size and number of inputs and outputs, for a fully-connected layer and channels in and out, for a convolutional layer" ?

Glad you enjoyed the post!

That depends on how you are running your network. In our APIs and most frameworks, you can specify these parameters when you define a layer and its inputs and outputs. Are you using cuBLAS or cuDNN, or a particular framework?

Hi Valerie, this blog said, "Earlier versions of cuDNN required the channel dimension of all tensors be a multiple of 8. That constraint no longer applies to packed NCHW data; cuDNN now automatically pads the tensors as needed."
but in this paper: "We recommend ensuring all such parameters are multiples of 8 when training with FP16 and multiples of 16 when training with INT8. These include batch size and number of inputs and outputs, for a fully-connected layer and channels in and out, for a convolutional layer."
For a convolutional layer, is it necessary to ensure channel dimensions are multiples of 8 ?

This is a very good question! The blog you linked to is correct: with data in the NCHW layout, cuDNN performs automatic padding of channel in and out counts of convolutional layers, so in that case Tensor Cores will activate even when channels in and out are not set to multiples of 8.

For brevity, this post focused on the strictest version of these rules: when using data in the NHWC layout, automatic padding won't occur. We talk about the difference between these formats in the Tensor Layouts section of the Deep Learning Performance Guide, if you'd like to read more. The Channels In and Out section of the guide also explains in more detail how this affects the rules for channel counts. (Channels In and Out describes special case kernels for layers with four input channels as well, which may be of interest!)

Thanks for the reply. I am using Caffe, I see that we can define a layer and its outputs, I guess the value of inputs in this case will be the outputs from last layer. But I am not sure I can define the batch size.

Hi,

I wonder if Tensor cores have wave quantization problem? Since different GPUs have different numbers of Tensor cores.

I'm a little rusty with Caffe, but if memory serves, the batch size is controlled by the shape of the tensor you use as input during training or inference, which is probably defined in the data layer. So the first dimension of your net.blobs, or other form of data tensor, would be the batch size for all layers.

Which framework works best for Tensor cores?

The overall Tensor Core count of a GPU doesn't have a separate wave quantization effect. SM-based wave quantization, the sort that we talk about in this post, occurs because layer parameters can be set to any value. So we can choose an inefficient batch size such that the training work can't be divided evenly among SMs once split into tiles / thread blocks. You don't need to worry about this issue at the Tensor Cores level because the tile sizes available are designed to allow efficient work by the set of Tensor Cores in an SM!

That question is very complex! There isn't any single preferred framework. Our Training With Mixed Precision guide, and in particular this section explaining how to set up and optimize for Tensor Cores in various frameworks, might be a good place to start.

I see. Thank you!

Do you mean that the tie sizes of Tensor cores are more flexible than the tie size options in cuBLAS?

My wording wasn't precise, sorry! By tile sizes, I mean those available in cuBLAS. This sort of tiling doesn't occur at the Tensor Cores level.

To illustrate, consider our feed-forward layer example from the post again. With a batch size of 2048, the equivalent output matrix would have dimensions of 4096 x 2048. Assuming the 256 x 128 tile size is used, 16 x 16 = 256 total tiles are created. These tiles can't be split evenly between the 80 SMs on a Tesla V100 GPU, so this case suffers from wave quantization.

With a tile size of 256 x 128, each SM handles the thread block for one tile at a time. The amount of work done by this thread block is controlled by the tile size, and we design the available tile sizes such that the corresponding thread blocks can be calculated jointly by the Tensor Cores on an SM with maximum efficiency. So you don't need to worry about wave quantization at this level.

Put another way: a Tesla V100 GPU has 80 SMs, and each SM has 8 Tensor Cores, for a total of 640 Tensor Cores on the GPU. However, wave quantization depends directly on the number of SMs and the tile size; the number of Tensor Cores isn't itself relevant. (Your intuition that the number of Tensor Cores affects quantization is correct in that the number of Tensor Cores is 8 times the number of SMs on this GPU; the information is already being taken into account!)

Hope this makes it clearer!

I see it now. Thank you so much for the reply! This answer is great!

Thanks for your reply! I have another question.
In the cuDNN Developer Guide:
"For algorithms other than *_ALGO_WINOGRAD_NONFUSED, when the following requirements are met, the cuDNN library will trigger the Tensor Core operations: The number of input and output feature maps is a multiple of 8."
Question: For algorithms *_ALGO_WINOGRAD_NONFUSED, what are the requirements ? Because the TF_ENABLE_WINOGRAD_NONFUSED variable is enabled by default