Convolution performance question on Volta in a windows environment

I was testing cudnnConvolutionForward on various GPU (two of which shown below). With the goal of seeing performance differences on the GV100 with and without tensor cores enabled vs. Pascal GPU. To make the test simple I used a single convolution, to make the differences clearer I ran it on an 8k image. When comparing times Titan X & P5000 were faster than the GV100.

I tried to look closer with Visual Profiler’s kernel analysis and ran into “Internal Profiling error” on the GV100. Then I also tried profiling in Visual Studio with the Nsight profiler and it failed to capture any kernels. Are some features still in development for Volta?
I saw that cuDNN uses 128x128 relu on Volta and 128x32 relu on Pascal.

Tested project in Visual Studio 15 & 17. Used compute_61,sm_61;compute_70,sm_70. CUDA 9.2 (with patch), Nsight 5.6.

Trying to profile with GV100 (Convolution time ~31ms)
tve — ImgBB

Profiling with a Titan X (Pascal) but used Maxwell cuDNN function (Convolution time ~16ms)
txo — ImgBB

I also ran Nvidia’s “conv_sample” in the Linux sample codes (Ran it on windows) and saw similar results.

I’m unsure why the GV100 is slower than the older cards that I have for convolution, I was thinking about moving it to a Linux machine to test there but thought I would post my question before more tests.

I was able to see/test performance differences with matrix multiplication on Volta but haven’t seen logical performance results for convolution with cuDNN.

Does anyone have insight on any of this?

Thank you

Update(6/27):
Nsight Visual Studio Edition 5.6 which supports Volta was released May 31, 2018, about a half a year after the first Volta desktop card was released December 7, 2017. I was using Nsight version 5.6.0.18099.

Experiencing this too…

We have a loss function runs 5s on Tesla V100 and 2s on Titan Xp. This is frustrating.