I’m using cuTENSOR for various contractions. The problem size is the following:

```
k: 50000
i: 4
j: 2
n: 3
l: 3
```

I then have contractions of the form:

```
{k,i,j,l} = {k,j,l,n} x {k,i,n}
{k,j} = {k,j,l} x {k,j,l}
{k,i,j} = {k,i,j,l} x {k,i,j,l}
```

So summation is always over the fastest varying mode, but the extent of that mode is small (always 3). Now the performance is really not what I would expect. E.g. the contraction `{k,i,j} = {k,i,j,l} x {k,i,j,l}`

would take up to 10ms (on a 2080Ti), while a custom written kernel took in the order of 10-20us. I also tried to rewrite this as a batched matrix multiply using cuBLAS, but performance was still bad at ~2ms. Is this some general problem with cuTENSOR/cuBLAS that small “summation dimensions” have bad performance? Is there anything I can do about this?