Pascal: CUDA 8.0 RC + cuDNN 5.1 unexpectedly slow

Been doing some Caffe CNN training with my old 960 GPU. Now I’ve upgraded to a 1070. In order to measure training performance, I’ve created a little test CNN with 3x3 kernels, because cuDNN 5.1 claims 2.7x performance improvement with 3x3 kernels. Here are my results:

Visual Studio 2013
CUDA 7.5, cuDNN 4.0, compute_52,sm_52
GeForce 960: 500 iterations ca. 20 seconds

Visual Studio 2015
CUDA 8.0 RC, cuDNN 5.1, compute_61,sm_61
GeForce 1070: 500 iterations ca. 13 seconds

To be honest, that’s quite underwhelming. My thinking was that the GPU upgrade by itself should already double performance. CUDA 8.0 RC + cuDNN 5.1 should add another 2.7x boost. So I was expecting an overall performance boost of about 5x. But instead training time only decreased from 20 seconds to 13 seconds.

Any idea what might be wrong, or how to get nearer to the expected performance boost?