Unexpected low fp16 performance on P3

I’m getting highly non-uniform performance for float16 matmul on P3 using recommended NVidia container. I was told by Tom Reed at GTC that this is not expected so maybe someone could redirect this to proper channel:

To reproduce, run the following on Volta machine.

wget https://raw.githubusercontent.com/yaroslavvb/stuff/master/matmul_benchmark_seq.py
export TF_CPP_MIN_LOG_LEVEL=1
python matmul_benchmark_seq.py --dtype=float16

You’ll see something like this.

7512,76.0847702634
8192,87.2323633474
8933,15.2443599021
9741,15.0255254543

This means that it got 87 Tops/second for 8192x8192 matmul, followed by 15 T ops/second for 8933.

For more graphs, see https://medium.com/@yaroslavvb/peak-performance-of-amazon-p3-instances-f2bc48f9ef71

For more details, I used Amazon Ubuntu CUDA 9 AMI – https://aws.amazon.com/marketplace/pp/B076TGJHY1?qid=1509675887754&sr=0-4&ref_=srh_res_product_title

Then used AWS instructions to optimize for GPUs
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/optimize_gpu.html

Then used nvidia docker with official TensorFlow container.

sudo nvidia-persistenced
sudo nvidia-smi -ac 877,1530 # p3

sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable"
sudo apt-get update
apt-cache search docker-ce
sudo apt-get install -y docker-ce
wget  https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
sudo dpkg -i nvidia-docker_1.0.1-1_amd64.deb

sudo docker login nvcr.io
sudo docker pull nvcr.io/nvidia/tensorflow:17.10

sudo nvidia-docker run -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm -v /home/ubuntu/docker:/data/mnist nvcr.io/nvidia/tensorflow:17.10

wget https://raw.githubusercontent.com/yaroslavvb/stuff/master/matmul_benchmark_seq.py
export TF_CPP_MIN_LOG_LEVEL=1
export CUDA_VISIBLE_DEVICES=0
python matmul_benchmark_seq.py --dtype=float16

img.png

Our engineering team states that all of k, lda, ldb, and ldc must be a multiple of eight; m must be a multiple of four. The Tensor Core math routines stride through input data in steps of eight values, so the dimensions of the matrices must be multiples of eight.

For more details see the post at https://devblogs.nvidia.com/parallelforall/programming-tensor-cores-cuda-9/

Testing on my Titan V I see spikes too.

I assume that spikes only become visible starting with 512x512 matrices because for smaller matmuls too much time is spent copying data:

430,1.0276874923
469,1.2882271302
512,2.2436777223
558,1.8147125514
608,3.6640702490
663,2.8910543966
724,3.6435559331
789,3.5648453471
861,4.4266646591
939,5.2201968200
1024,14.8427175163
1116,6.2453918178

I am still trying to get significant performance increase in more realistic DL-tasks with V100 (Titan V). So far, even with very matmul/conv-heavy architectures (Transformer) I only see 25% performance increase when switching to FP16 - nothing like these spikes in this sythentic test.

Also, I don’t see “doubling” of available memory: I can only increase batch-size ~10% when switching all my variables to FP16 from FP32 before hitting out-of-memory. I guess, my Tensorflow implementation has FP32’s somewhere, and lots of it.

From above observations let’s assume that for 8192x8192 matmul V100 becomes compute-bound:

  • it needs to transfer ~256MiB plus moving data within GPU
  • it performs ~2Tflops

For 1x1 2d convolution with N input-channels, and N output-channels:

  • we need ~floor(sqrt(134217728/N))-sized matrix to apply conv to, to get the same amount of data
  • it will perform (N**2 + (N-1)*N)*(134217728/N)~268435456N flops~0.27N Gflops (we need N~10000 to get the same amount of computation as 8192x8192 matmul

In practice I am getting the error when trying to create matrix that big. Had to scale the size down by x100. Error looks like:

tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : m=134212225, n=1, k=1
         [[Node: Conv2D = Conv2D[T=DT_HALF, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Variable/read, Variable_1/read)]]

Testing (expanding Yaroslav’s code: https://gist.github.com/dimitry12/d8eb165eb9ecd474d6a017156bec3466#file-conv-py-L76-L79):

38,0.6277977837
41,0.7149783805
45,0.8152390789
49,0.9075997542
53,1.0192397357
58,1.0858789485
64,1.5005017117
69,1.3305306010
76,1.4670993722
82,1.4340319776
90,1.5266188081
98,1.6651400846
107,1.9491080286
117,2.1242829384
128,2.9893740279
139,1.9558330697
152,2.9484651222
165,2.3725707415
181,2.5822941898
197,2.9695160932
215,3.1080774282
234,3.2204323179
256,5.3002793901
279,3.8160422806
304,4.7311686984
331,3.7756175425
362,4.2303946644
394,3.9343579247
430,3.8086561374
469,5.0551012201
512,6.7722605583
558,4.7593597729
608,7.7169478252
663,4.5043897079
724,4.3966474121
789,4.4033521839
861,4.3245146979
939,3.8774351501
1024,4.4114756494

Spikes are present, but are less exciting and overall flops are much-much lower (my math is certainly wrong somewhere). At least it confirms for me that nvcr.io/nvidia/tensorflow:17.12’s conv2d does use Tensor Cores.

Interestingly, performance degrades as number of channels increases (and HxW correspondingly decreases).

Really, with 1x1 kernel conv2d is not even matrix-matrix multiply, but vector-matrix multiply - I am suprised spikes (as a tell-tale of Tensor Cores) even show-up.