Hello,

I am trying to design a kernel to execute the following operation:

`A(n,m,l) = B(i,j,k)*C(i,n)*C(j,m)*C(l,k)`

where B is a symmetric sparse tensor (99.9% sparse) and C is a dense matrix. Due to the highly specific nature of this problem (symmetric, sparse & tensor) I have to write my own kernel. The first approach (which I can get working) uses one kernel to calculate every element of the tensor A. This requires a kernel launch for each element of A but more importantly forces me to access the matrix C completely out of order due to the sparsity of B. This kernel ends up using 90% of memory bandwidth and doing ~75% of memory accesses uncoalesced. It feels like there is performance to be gained.

- Currently B is stored as a COO matrix.

```
function kernel1(B_sparse_COO, C, n, m, l)
f = (coo_data) -> coo_data.val * C[coo_data.i,n] * C[coo_data.j,m] * C[coo_data.k,l]
return mapreduce(f, +, B_sparse_COO)
end
```

My only other idea is outlined in the pseudo-code below. Here instead I accumulate all of the values in A at the same time. I believe this would allow me to coalesce memory accesses better because the random indicies i,j,k are now fixed. This still requires a kernel launch for every element of B which Iâ€™m not a huge fan of. Any advice would be greatly appreciated! Thanks.

```
function kernel2(A,B,C,p,I,J,K,V)
for n <= m <= l #each n,m,l is its own thread
A[n, m, l] += B[p] * C[I[p], n] * C[J[p], m] * C[K[p], l]
end
end
for p = 1:length(B)
(I, J, K, V) = find_nonzero(B)
kernel2(A,B,C,p,I,J,K,V)
end
```

For the second kernel, Iâ€™m in the process of implementing but a little confused how to launch a kernel with thread-ids that go through only the upper triangle of the tensor (to avoid double counting symmetrical elements).