Help with Kernel for Multi Sparse Tensor time Matrix (Multi-TTM)


I am trying to design a kernel to execute the following operation:
A(n,m,l) = B(i,j,k)*C(i,n)*C(j,m)*C(l,k)
where B is a symmetric sparse tensor (99.9% sparse) and C is a dense matrix. Due to the highly specific nature of this problem (symmetric, sparse & tensor) I have to write my own kernel. The first approach (which I can get working) uses one kernel to calculate every element of the tensor A. This requires a kernel launch for each element of A but more importantly forces me to access the matrix C completely out of order due to the sparsity of B. This kernel ends up using 90% of memory bandwidth and doing ~75% of memory accesses uncoalesced. It feels like there is performance to be gained.

  • Currently B is stored as a COO matrix.
function kernel1(B_sparse_COO, C, n, m, l)
    f = (coo_data) -> coo_data.val * C[coo_data.i,n] * C[coo_data.j,m] * C[coo_data.k,l]
   return mapreduce(f, +, B_sparse_COO)

My only other idea is outlined in the pseudo-code below. Here instead I accumulate all of the values in A at the same time. I believe this would allow me to coalesce memory accesses better because the random indicies i,j,k are now fixed. This still requires a kernel launch for every element of B which I’m not a huge fan of. Any advice would be greatly appreciated! Thanks.

function kernel2(A,B,C,p,I,J,K,V)
     for n <= m <= l #each n,m,l is its own thread
        A[n, m, l] += B[p] * C[I[p], n] * C[J[p], m] * C[K[p], l]

for p = 1:length(B)
    (I, J, K, V) = find_nonzero(B)

For the second kernel, I’m in the process of implementing but a little confused how to launch a kernel with thread-ids that go through only the upper triangle of the tensor (to avoid double counting symmetrical elements).

I think the second kernel is on the right track, or at least what I would aim for, as an initial baseline for further benchmark comparison.

I guess I don’t understand the concern. If the calculations were symmetric, I could see wanting to cut things in half. But if B is symmetric and C is not, I’m not sure how the calculations are symmetric. Are you saying you want to do this:

   A[n, m, l] += B[p] * (C[I[p], n] * C[J[p], m] * C[K[p], l] + C[I[ps], n] * C[J[ps], m] * C[K[ps], l])

where ps is the “symmetric index” corresponding to the element B[p] (i.e. B[p] = B[ps])

If so that strikes me as a relatively small concern or optimization, not something I would worry too much about for a first level implementation. If this is a repetitive calc, and the B matrix is unchanging, then you could consider rearranging the nz values of B so that they are organized symmetrically, that is, make sure all the upper triangular values appear first. You then know the lower triangular values, both value and coordinate, without having to retrieve them.

1 Like

The matrix B is symmetric but the output A is symmetric as well so I only want to calculate the unique part of A hence why I want to loop over for n <= m <= l when accumulating values into A. I’ll try and re-word this in a simpler problem that has the same core issue.

If I have two upper triangular matrices D & E with everything else 0 and I want to add them I only really need to loop through the upper triangle of these matrices. The first set of code below is the sort of obvious way around this but would then lead to launching a bunch of threads that do nothing or even divergence.

The second would launch one thread per matrix element and ignore the zero elements; however, it is unclear to me how to recover 2D indices into the matrix from the thread/block ids. There might be some clever formula but I need this to generalize to 3D and 4D. I can pre-compute all of the 2D/3D/4D indices and put them in a vector before I launch the kernel but that doubles my memory needs as I know have a tensor of elements and a vector of indices that is the same size as that tensor. Is there some programming pattern people use to get around this or is an if statement best? Thanks!

What will work but leaves threads idle & cause divergence:

i = thread Idx x equation
j = thread Idx y equation
if ( i <= j)
     out[n,m] = D[n,m] + E[n,m]

What I would like to do.

i = thread Idx equation
n = ??? #some function of i
m =  ???
out[n,m] = D[n,m] + E[n,m]

I see. A 2D linear-to-triangular conversion can be done with relatively simple non-recursive math (step 7 there). Here is another example.

Agreed. The 3D case didn’t lend itself to an easy calculation for me. You can do a recursive calculation, of course, but most folks probably don’t want to do that.

You could use the 2D methodology to cut the space in half, then allowing for additional waste/unused threads.

1 Like

To go to 3D case, we could use the formula for tetrahedral numbers to express an equation for the tetrahedral number n corresponding to a given linear index i, as

n3 + 3n2 + 2n - 6i = 0

we could use the cubic formula to get a closed form solution for finding n for a given i. I assume of the 3 roots only one would be real and positive.

The nearest integer larger than or equal would then give us our z coordinate from the linear index i. From there we would know the dimensions of the corresponding slice in z, and could use the closed form solution previously suggested for computing the triangular coordinates from the remainder of the linear index i and the tetrahedral number associated with z.

Cool, I learned something. I had never come across tetrahedral numbers before. Assuming the cubic equation above is correct, Wolfram Alpha indicates that one can compute the real root using

float t = sqrtf (729.0f * i * i - 3.0f) + 27.0f * i;
float n = cbrtf (t / 9.0f) + 1.0f / cbrtf (3.0f * t) - 1.0f;

Quick test using float computation:

i= 1 n=1.00000000 n**3+3n**2+2n-6i= 0.000000
i= 2 n=1.43484151 n**3+3n**2+2n-6i= 0.000002
i= 3 n=1.74783683 n**3+3n**2+2n-6i= 0.000002
i= 4 n=2.00000000 n**3+3n**2+2n-6i= 0.000000
i= 5 n=2.21446800 n**3+3n**2+2n-6i= 0.000002
i= 6 n=2.40284801 n**3+3n**2+2n-6i= 0.000004
i= 7 n=2.57189775 n**3+3n**2+2n-6i=-0.000004
i= 8 n=2.72594213 n**3+3n**2+2n-6i=-0.000011
i= 9 n=2.86793637 n**3+3n**2+2n-6i=-0.000011
i=10 n=3.00000000 n**3+3n**2+2n-6i= 0.000000
i=11 n=3.12371182 n**3+3n**2+2n-6i= 0.000000
i=12 n=3.24028277 n**3+3n**2+2n-6i=-0.000008
i=13 n=3.35066581 n**3+3n**2+2n-6i= 0.000008
i=14 n=3.45562339 n**3+3n**2+2n-6i=-0.000008
i=15 n=3.55577970 n**3+3n**2+2n-6i= 0.000008
i=16 n=3.65164924 n**3+3n**2+2n-6i= 0.000000
i=17 n=3.74366522 n**3+3n**2+2n-6i= 0.000000
i=18 n=3.83219409 n**3+3n**2+2n-6i= 0.000015
i=19 n=3.91754866 n**3+3n**2+2n-6i= 0.000008
i=20 n=3.99999976 n**3+3n**2+2n-6i=-0.000015
1 Like

The first 4 tetrahedral numbers are 1, 4, 10, 20, so that seems correct as far as it goes. The first linear index (1) would correspond to the first z-slice (slice dimensions 1x1), which has only one element. The next 3 linear indices (2-4) would correspond to the second z-slice, that has 3 elements in it (slice dimensions 2x2, triangular), the next 6 linear indices (5-10) would correspond to the third z-slice, that has 6 elements in it (slice dimensions 3x3, triangular).

for example I would expect the i=20 output to be n=4.0

Now that I think about it for large i, the solution method might have enough inaccuracy that eventually you would have a calculated value end up slightly higher than it should be. If this happened at a slice breakpoint, you could incorrectly calculate that (e.g. for the above example if the solution calculated n=4.00000001).

I don’t know how to resolve that. I would assume that eventually double precision calculations would run into problems for large enough i. Perhaps you could do the reverse calculation to “check your work” using integers, to figure out if an off-by-one error had occurred.

The integer calculation method for the 2D triangular case doesn’t suffer from this particular issue.

1 Like

Super cool, it looks the like the tetrahedral numbers generalize to higher orders as well. This could be super useful if I can find away around the error. The main problem I see is that my tensors will ultimately have millions of elements so the error will stack up.

I guess the main question is: Does it require more work to convert the 1D index into a proper 2D/3D/4D index or does it require more work to launch a bunch of warps that do nothing? For example with this code if I just launch a normal 2D grid of thread blocks (like below) will this cause divergence? At a minimum I there will be threads doing nothing. When I eventually get to the 4D case only 1/24th of the tensor will actually need to be looped.

function kernel()
    i = ThreadIdx.x
    j = ThreadIdx.y
    if (i <= j)
          # Do stuff

A super ideal solution would be to somehow build an iterator/generator for these indices that is indexable, but I don’t know enough about GPU programming to know if you can use those on the GPU.

Probably the best answer is to code up a comparison and do benchmarking. My general opinion is that launching threads that do nothing is not very expensive, and furthermore the floating point calculations that we would be doing on every thread for the 3D case don’t strike me as trivial, but comparing that against a ~8x thread overload is not something I’m sure about without testing. The concerns about error shouldn’t preclude getting useful performance results from a test case.

I realized its easier to just start with the tensor unrolled so that it takes 1D indices by default. Then inside the kernel I can convert the 3D index into a 1D index. After the fact I can rebuild the 3D tensor using the tetrahedral numbers which are much easier to generate if you already know what row you’re on!

Thanks for the help. This tetrahedral number thing is actually super neat. And super useful that it generalizes to any dimension.

My kernel in pseudocode would basically be:

function kernel(value, A, B, arraySize){
    i = #index eqn
    j = #index eqn
    k = #index eqn

    idx = to_1D_index(i, j, k, arraySize)

    #check that idx < length(A) without if/else somehow (use min/max probably)

    A[idx] += value * B[i] * B[j] * B[k]

I’m glad you found a simpler method. Because I had started to flesh out in code what I had described (and using the equations supplied by njuffa/wolfram) and immediately ran into floating point trouble. My next step was to use the floating point version to land somewhere on the number line, then test the integers on either side of that point to determine if either was a slice breakpoint. It was getting messier than I hoped, quicker than I hoped. So although the floating point method seems interesting, in practice it is troublesome, based on my investigations so far.

Yeah the original approach basically boils down to root finding on non-quadratic polynomials which is its own entire mess.

Thanks for all your help with this. Always useful to talk to other people and learn something new. Still a lot of work to do optimizing the kernel but I’ve got the blueprints now. Not really sure what to mark as the solution since its kinda split across the thread. I’ll open a new thread if I have optimization questions.

I took the liberty of picking a solution comment.

I realized what I originally proposed in the solution to was a little flawed but on the right track. The actual solution to loop the only the lower triangle of a symmetric matrix is as follows. Given a symmetric matrix A take the lower triangular part and flatten it (the values here aren’t actually used in the calculation):

A = [1 2 3;
     2 1 8;
     3 8 5]

A_lower_flat = [1, 2, 1, 3, 8, 5]
The new caveat is that you also need to store the row information of each element:
rows = [1, 2, 2, 3, 3, 3]
This row combined with a 1D index from a CUDA kernel can be used to calculate the column of the element in the 2D matrix like so:
col = 1D_idx - triangular_numbers(row - 1)
and boom you now have the 2D index from the 1D index. The extra storage for the row information is unfortunate but is manageable I think. This idea should also generalize to 3D where you will use the tetrahedral numbers and have to store the row & column to calculate the 3rd index.

CUDA Kernel Pseudocode:

function kernel(K, B, rows, val){
   i = #1D index from CUDA blocks/grid/thread
   col = i - tri_num(row[i] - 1)
   K[i] += val * B[row] * B[col]

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.