symmetrix matrix - addressing not quite CUDA, rather algorithm

maybe its too late, maybe its something with my brain, but I’cant find
solution to my problem.

If anyone could help me I’ll be very thankful!

Ive got symmetric matrix like below and I need to be able to get to
the right value using two addressing schemes: [row, column] and
like that (both row/column and sequence index provided) :

…0…1…2…3…4 [column]


(drawing in ascii sucks ;)

mat dim is 5 in above example.
inside cells are indices of elements (values stored in matrix doesnt
matter in this context) numeration is zero based.
row0, col0 is index 0
row0, col1 is index 1

row2,col3 is index 10 and so on.

If provided with row and col its easy to find index -
for this matrix of dimension 5 the equation is:

(row*(9-row))/2 + col

so for ex inserting to above equation row = 2, col = 3 give us:
(2*(9-2))/2+3 = 14/2 + 3 = 10 which is right index.

But… i need the oposite and I cannot find formula :( :(
Provided with index I need to find col and row:
like for index 10 the correct answer would be row 2 and col 3.

Why I need that?
I’m provided with symmetric matrices stored in flat array like that:
(for example above, array is of 15 lenght)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

  • i have an info on matrix dimension (in that case: 5).
    In my program every thread is assigned an index and has to evaluate
    col and row.
    Each thread only knows about matrix dim and his index, so for ex:
    thread x receives index 8 and knows that matdim is 5 - it has to
    outputs row = 1 and col =4 (again look at picture above) - thats all!
    The trick is it has to be without thread divergence, no ifs and loops (or a least same for every thread) - It
    has to be simple formula like the one for row,col into index above but
    the opposite:
    index -> row, col

Address translation from row,col into index is trivial,
the opposite is like brick wall for me now :/
(it should be simple since translations in both ways are unique)

Please help - thanks a lot!

That the translations are unique only means it is solvable, not that it is simple, otherwise there would be no public key cryptograhpy.

Nevertheless the formula to get the row is simple enough:


Where n is the number of rows/columns of the matrix.

With the formula written like this you can just round down to the nearest integer in both the sqrt and the / 2 to get the proper integer result, if you implement the sqrt as a precalculated table you can use integer-only arithmetic. If you don’t mind using floating point (the precision should be good enough if your matrix does not get larger than a few 1000 entries) you can of course avoid one division by pulling the /2 in.

The formula to get the column given the row and index should be simple enough to find yourself ;-)

right :smile:

fact is simple - thx! I was pursuiting one without all that floating point hassle, so I didnt even look in quadratic formula direction. Now i see there is no chance to get it with integers only (or is it?). :sad:

thx again

ps. col= index - row*(2*n-1-row)/2 :wink:

As I said, as I wrote it, the formula works perfectly fine with integer-only operations.

The only problem is the sqrt, GPU hardware (as most other hardware) does not provide an integer-only one.

You can either get it by using the floating-point one and rounding down, or you can create your own one (e.g. table-based or mixed calculation and table).

If LGPL license is acceptable to you FFmpeg’s libavutil/internal.h contains such code, though you might end up with either some divergent branches or rather slow code involving shifts divisions etc. pp. - I don’t see why that would be much of a problem though.

Otherwise older literature, particulary about game programming should contain lots of hints about doing fast integer sqrt etc.


according to your suggestion i googled a bit and found a few integer sqrt implementations - most of them full of branches and loops so I didnt even bother to start studying them. Only one “clean” enough is that:

int isqrt (long r) {

     float tempf, x, y, rr;

     int is;

    rr = (long) r;

     y = rr*0.5;

     *(unsigned long *) &tempf = (0xbe6f0000 - *(unsigned long *) &rr) >> 1;

     x = tempf;

     x = (1.5*x) - (x*x)*(x*y);

     if (r > 101123) x = (1.5*x) - (x*x)*(x*y);

     is = (int) (x*rr + 0.5);

     return is + ((signed int) (r - is*is)) >> 31;


it has one branch that, maybe, will be predicated (im not sure, i dont quite get branch predication on g80).

Also it contains so many muls that it is imho unlikely it will beat standard sqrt (32 cycles according to NVIDIA Prog. Guide)

Id rather stay with nvidia float sqrt implementation.

Could you please help me with last one thing -

the formula for finding row having an index:

row = n - floor( (1 + sqrt(4n(n+1)-7-8*dim)) / 2 );

ill replace the integer under sqrt with x

row = n - floor( (1 + sqrt(x)) / 2 );

where row,n,x are integers.

How to (I have casts/conversions on mind) write it for it to be numericaly correct and fast?

No explicits casts/conversions:

int row = n - floor( (1 + sqrt(x)) / 2 );

Float numbers:

int row = n - floor( (1.0f + sqrt(x)) / 2.0f );

x conversion to float:

int row = n - floor( (1.0f + sqrt(  __int2float_[rn,rz,ru and rd](x) )) / 2.0f );

(btw. why would one need four rounding modes when converting int to float?? rn,rz,ru and rd. I dont get it - int has no fractional part so whats to round?)

And use sqrt or sqrtf?

Thanks a lot - The SDK and Programming Guide are pretty sketchy on that topic.

That certainly is not an integer sqrt as I meant it, it is full of using floats and IMO generally horrible. Probably none of the implementations that do not use a table

make any sense.

The normal sqrt will do just fine as long as it is precise enough (has no chance of working reliably with matrixes larger than 2000 columns), if not you can still improve the precision by doing the classical (where sqr is a float variable and val int) sqr = (sqr + val/(int)(sqr)) / 2; a few times which should be able to get you up to matrices of about 20000 columns - but you will have to test that.

Not that it might be faster to do x >> 1 instead of x / 2, I do not know that (if you use unsigned instead of int the compiler might be able to do this conversion for you, though I doubt it).

If you already use floats, you can normally take advantage of the fact that x = constant + 8index, and avoid the division by doing floor(0.5 + sqrt(constant2 + 2index)), where constant2 is 1/4 of constant. You can then no longer use an integer sqrt and improving the precision to work with larger matrices will be more complex though.

If you can not do that kind of optimization, you can still replace / 2.0 by * 0.5 - in case the compiler is not bright enough to do it by itself (which is quite likely since such a change is not always correct).

What should that have to do with fractional parts? Rounding is about inaccurate representations, and not all ints can be represented by a float. Though if that case happens that code will not work anyway.