cuTensor contraction ~5X slower than equivalent CuBLAS sgemm?

Despite having found no thread on this forum mentioning cuTensor that actually got answered, I will try my luck:

Trying out cuTensor 1.2.0 after having done some work with cuBLAS.
Both being called directly from C++, no higher level framework.

I have converted an existing, working cublasGemmEx() to the equivalent sequence of (tedious) cutensor calls according to the cutensorContractionSimple sample code (
Everything works and results are numerically ok.

Tensor layouts are all packed (default strides) and optimal for cublasGemm with no transposes.
The cuTensor descriptions of the tensors are slightly more complicated than the cublas descriptions, as they describe two innermost axes (modes) that cublasGemm sees as a single inner axis. The same is true for the two outermost axes.

Using nSight Systems to look at kernel run times, cuTensor is almost 5X slower than cublas at making the exact same computation on the exact same data layouts.
cuBLAS calls “volta_sgemm_128x64_nn” taking 17.4 ms
cuTensor launches “contraction_kernel” that takes 82ms

Is this to be expected? <<<<

cuTensor is indeed more general than cublas but I would expect at least that cases that easily degenerate into standard matrix multiplication will be handled roughly equivalently. A cutlass-like several % degradation would be ok, but 5X rules out cuTenosr as a possible usable framework.

shapes listed ROW major, inner dim on right
Odd sizes are intentional, alignment and use of tensor cores is a non-goal here.
Contraction is on outermost dim of A, innermost of B, yielding 1001×970×3×128
Cublas sees this as 970970×151 multiplied by 151×384 yielding 970970×384
Strides are default (packed) and dtype is float32 all over, for the most vanilla gemm possible.

Ubuntu 18.04
Tesla T4 GPU, compute capability 7.5
CUDA 10.1/10.2
Driver 440.33.01