cuTENSOR performance for small contraction extents

I’m using cuTENSOR for various contractions. The problem size is the following:

k: 50000
i: 4
j: 2
n: 3
l: 3

I then have contractions of the form:

{k,i,j,l} = {k,j,l,n} x {k,i,n}
{k,j}     = {k,j,l}   x {k,j,l}
{k,i,j}   = {k,i,j,l} x {k,i,j,l}

So summation is always over the fastest varying mode, but the extent of that mode is small (always 3). Now the performance is really not what I would expect. E.g. the contraction {k,i,j} = {k,i,j,l} x {k,i,j,l} would take up to 10ms (on a 2080Ti), while a custom written kernel took in the order of 10-20us. I also tried to rewrite this as a batched matrix multiply using cuBLAS, but performance was still bad at ~2ms. Is this some general problem with cuTENSOR/cuBLAS that small “summation dimensions” have bad performance? Is there anything I can do about this?

From here:

Try to keep the extent of the fastest-varying mode (a.k.a. stride-one mode) as large as possible.

Yes, I saw this in the documentation. I understood this as having the dimension to sum over as the fastest-varying stride. Or does it not matter which dimension we sum over, but the innermost should be largest in any case?

I’m not a cuTENSOR expert. However the documentation seems to be saying that for best performance, cuTENSOR desires that the largest dimension be the same as the stride-1 dimension. Your case appears to be exactly opposite of that. I don’t have any further comments.