Unexpected low fp16 performance on P3

yaroslavvb · November 3, 2017, 2:30am

I’m getting highly non-uniform performance for float16 matmul on P3 using recommended NVidia container. I was told by Tom Reed at GTC that this is not expected so maybe someone could redirect this to proper channel:

To reproduce, run the following on Volta machine.

wget https://raw.githubusercontent.com/yaroslavvb/stuff/master/matmul_benchmark_seq.py
export TF_CPP_MIN_LOG_LEVEL=1
python matmul_benchmark_seq.py --dtype=float16

You’ll see something like this.

7512,76.0847702634
8192,87.2323633474
8933,15.2443599021
9741,15.0255254543

This means that it got 87 Tops/second for 8192x8192 matmul, followed by 15 T ops/second for 8933.

For more graphs, see Peak performance of Amazon P3 instances | by Yaroslav Bulatov | Medium

For more details, I used Amazon Ubuntu CUDA 9 AMI – AWS Marketplace: Deep Learning AMI with Source Code (CUDA 9, Ubuntu)

Then used AWS instructions to optimize for GPUs

Then used nvidia docker with official TensorFlow container.

sudo nvidia-persistenced
sudo nvidia-smi -ac 877,1530 # p3

sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable"
sudo apt-get update
apt-cache search docker-ce
sudo apt-get install -y docker-ce
wget  https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
sudo dpkg -i nvidia-docker_1.0.1-1_amd64.deb

sudo docker login nvcr.io
sudo docker pull nvcr.io/nvidia/tensorflow:17.10

sudo nvidia-docker run -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm -v /home/ubuntu/docker:/data/mnist nvcr.io/nvidia/tensorflow:17.10

wget https://raw.githubusercontent.com/yaroslavvb/stuff/master/matmul_benchmark_seq.py
export TF_CPP_MIN_LOG_LEVEL=1
export CUDA_VISIBLE_DEVICES=0
python matmul_benchmark_seq.py --dtype=float16

gcrider · November 7, 2017, 11:05pm

Our engineering team states that all of k, lda, ldb, and ldc must be a multiple of eight; m must be a multiple of four. The Tensor Core math routines stride through input data in steps of eight values, so the dimensions of the matrices must be multiples of eight.

For more details see the post at https://devblogs.nvidia.com/parallelforall/programming-tensor-cores-cuda-9/

dzmitry · December 28, 2017, 3:48am

Testing on my Titan V I see spikes too.

I assume that spikes only become visible starting with 512x512 matrices because for smaller matmuls too much time is spent copying data:

430,1.0276874923
469,1.2882271302
512,2.2436777223
558,1.8147125514
608,3.6640702490
663,2.8910543966
724,3.6435559331
789,3.5648453471
861,4.4266646591
939,5.2201968200
1024,14.8427175163
1116,6.2453918178

I am still trying to get significant performance increase in more realistic DL-tasks with V100 (Titan V). So far, even with very matmul/conv-heavy architectures (Transformer) I only see 25% performance increase when switching to FP16 - nothing like these spikes in this sythentic test.

Also, I don’t see “doubling” of available memory: I can only increase batch-size ~10% when switching all my variables to FP16 from FP32 before hitting out-of-memory. I guess, my Tensorflow implementation has FP32’s somewhere, and lots of it.

dzmitry · December 28, 2017, 4:50am

From above observations let’s assume that for 8192x8192 matmul V100 becomes compute-bound:

it needs to transfer ~256MiB plus moving data within GPU
it performs ~2Tflops

For 1x1 2d convolution with N input-channels, and N output-channels:

we need ~floor(sqrt(134217728/N))-sized matrix to apply conv to, to get the same amount of data
it will perform (N**2 + (N-1)*N)*(134217728/N)~268435456N flops~0.27N Gflops (we need N~10000 to get the same amount of computation as 8192x8192 matmul

In practice I am getting the error when trying to create matrix that big. Had to scale the size down by x100. Error looks like:

tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : m=134212225, n=1, k=1
         [[Node: Conv2D = Conv2D[T=DT_HALF, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Variable/read, Variable_1/read)]]

Testing (expanding Yaroslav’s code: https://gist.github.com/dimitry12/d8eb165eb9ecd474d6a017156bec3466#file-conv-py-L76-L79):

38,0.6277977837
41,0.7149783805
45,0.8152390789
49,0.9075997542
53,1.0192397357
58,1.0858789485
64,1.5005017117
69,1.3305306010
76,1.4670993722
82,1.4340319776
90,1.5266188081
98,1.6651400846
107,1.9491080286
117,2.1242829384
128,2.9893740279
139,1.9558330697
152,2.9484651222
165,2.3725707415
181,2.5822941898
197,2.9695160932
215,3.1080774282
234,3.2204323179
256,5.3002793901
279,3.8160422806
304,4.7311686984
331,3.7756175425
362,4.2303946644
394,3.9343579247
430,3.8086561374
469,5.0551012201
512,6.7722605583
558,4.7593597729
608,7.7169478252
663,4.5043897079
724,4.3966474121
789,4.4033521839
861,4.3245146979
939,3.8774351501
1024,4.4114756494

Spikes are present, but are less exciting and overall flops are much-much lower (my math is certainly wrong somewhere). At least it confirms for me that nvcr.io/nvidia/tensorflow:17.12’s conv2d does use Tensor Cores.

Interestingly, performance degrades as number of channels increases (and HxW correspondingly decreases).

Really, with 1x1 kernel conv2d is not even matrix-matrix multiply, but vector-matrix multiply - I am suprised spikes (as a tell-tale of Tensor Cores) even show-up.

Topic		Replies	Views
1660ti slower than 1050 (not ti) in tf.float16 matrix multiplication Frameworks (archived) tensorflow	0	685	May 12, 2020
matmul with large matrices fails with float16, but succeeds with float32 Frameworks (archived) tensorflow	0	748	April 2, 2019
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2707	August 12, 2017
Roofline Tensor Core should be half but not float? Nsight Compute	3	1574	May 29, 2024
No performance difference between Float16 and Float32 optimized TensorRT models Jetson AGX Xavier tensorrt	4	3405	October 10, 2021
No performance improvement on Jetson Nano FP16 vs FP32 TensorRT	6	2819	February 22, 2021
Differences in Precision Between Tensor Cores and CUDA Cores CUDA Programming and Performance cuda	1	372	January 10, 2025
Why the number of flops is different between FP32 and FP16 mode with YOLOv3 TensorRT implementation? Jetson AGX Xavier tensorrt , kernel , profiling	8	4200	October 18, 2021
Reduced CuBLAS performance on a particular problem size? GPU-Accelerated Libraries	0	443	October 13, 2020
How to test the fp16 benchmark performance on tx1？ Jetson TX1	2	812	October 18, 2021

Unexpected low fp16 performance on P3

Related topics