I’m using cuTENSOR for various contractions. The problem size is the following:
k: 50000
i: 4
j: 2
n: 3
l: 3
I then have contractions of the form:
{k,i,j,l} = {k,j,l,n} x {k,i,n}
{k,j} = {k,j,l} x {k,j,l}
{k,i,j} = {k,i,j,l} x {k,i,j,l}
So summation is always over the fastest varying mode, but the extent of that mode is small (always 3). Now the performance is really not what I would expect. E.g. the contraction {k,i,j} = {k,i,j,l} x {k,i,j,l}
would take up to 10ms (on a 2080Ti), while a custom written kernel took in the order of 10-20us. I also tried to rewrite this as a batched matrix multiply using cuBLAS, but performance was still bad at ~2ms. Is this some general problem with cuTENSOR/cuBLAS that small “summation dimensions” have bad performance? Is there anything I can do about this?