Problem in fast matrix multiplication

Hey guys, i am trying to implement the “Fast” matrix multiplication as stated in this power point presenation:

http://www.google.pl/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=5&ved=0CFkQFjAE&url=http%3A%2F%2Fcoachk.cs.ucf.edu%2Fcourses%2FCDA6938%2Fs08%2Fmatrix_multiplication_hongliang.ppt&ei=MueOT5cjw-O1BoLQsKQJ&usg=AFQjCNFnfnU3X2SCsEXNcaVFxtYau2NWKg

It does have a chunk of code and i am trying to implement it but at one point, on the slides there are using:

comp16(b, &ashare[k][0],c)

and i have no idea what it does. I tried googling for it but with no luck. Can anyone shed any light on it and if anyone has implemented this method for matrix multiplication, is it actually faster?

I am struggling big time, so i hope you can help me out a bit.

Thanks again.

This possibly originates from my matrix multiply code published here: http://forums.nvidia.com/index.php?showtopic=47689&st=40&p=314014&#entry314014 Check calls to saxpy() there.

That code was optimized for G80 and was faster than CUBLAS at the time. I don’t think it is fast/faster on Fermi and Kepler.