I am trying to design a kernel to execute the following operation:
A(n,m,l) = B(i,j,k)*C(i,n)*C(j,m)*C(l,k)
where B is a symmetric sparse tensor (99.9% sparse) and C is a dense matrix. Due to the highly specific nature of this problem (symmetric, sparse & tensor) I have to write my own kernel. The first approach (which I can get working) uses one kernel to calculate every element of the tensor A. This requires a kernel launch for each element of A but more importantly forces me to access the matrix C completely out of order due to the sparsity of B. This kernel ends up using 90% of memory bandwidth and doing ~75% of memory accesses uncoalesced. It feels like there is performance to be gained.
- Currently B is stored as a COO matrix.
function kernel1(B_sparse_COO, C, n, m, l) f = (coo_data) -> coo_data.val * C[coo_data.i,n] * C[coo_data.j,m] * C[coo_data.k,l] return mapreduce(f, +, B_sparse_COO) end
My only other idea is outlined in the pseudo-code below. Here instead I accumulate all of the values in A at the same time. I believe this would allow me to coalesce memory accesses better because the random indicies i,j,k are now fixed. This still requires a kernel launch for every element of B which I’m not a huge fan of. Any advice would be greatly appreciated! Thanks.
function kernel2(A,B,C,p,I,J,K,V) for n <= m <= l #each n,m,l is its own thread A[n, m, l] += B[p] * C[I[p], n] * C[J[p], m] * C[K[p], l] end end for p = 1:length(B) (I, J, K, V) = find_nonzero(B) kernel2(A,B,C,p,I,J,K,V) end
For the second kernel, I’m in the process of implementing but a little confused how to launch a kernel with thread-ids that go through only the upper triangle of the tensor (to avoid double counting symmetrical elements).