Hi all. CUDA beginner aboard, so please excuse me for missing anything obvious :)

I wrote a device function which calculates a covariance matrix from discrete values; in a loop, I sum the products of the values from each pair of input variables. In the same loop, I accumulate the sum of each variable so that I can calculate the final covariance at the end using the standard formula of C(X,Y) = E(X*Y)-(avg(X)*avg(Y)). When I’m finished, I perform a standard matrix inversion. No problem so far.

I then made a copy of this function that performs the same operations, except I removed the variable summations and the second term from the covariance calculation, left instead with C(X,Y) = E(X*Y). The functions are exactly the same except for the extra operations performed by the first function; that is, I took no shortcuts, changed no math, changed no memory accesses (except the removal of the structure used to hold the sums). I’m using the same matrix inversion function.

And yet, the first function – the one that performs more math – actually runs faster than the second function, which is the first function with some steps stripped away.

I can’t see any reason for this to be. I’m not accessing global memory in this function; just a few local memory arrays and some register variables. The second function compiles into fewer registers than the first function. I am confident my timing mechanism is accurate (I’m actually using the timer from cutil). I tried adjusting the size of my input to see if the speed had to do with the size of the values before going into the inversion step, but that didn’t do it either.

So, now I’m out of ideas. Any advice on what I could investigate next? I’m guessing the compiler must be doing something for me, but I don’t see why it would optimize the first implementation and leave the second slower.

Thanks for your help!