Function that does more math runs faster Strange performance behavior

Hi all. CUDA beginner aboard, so please excuse me for missing anything obvious :)

I wrote a device function which calculates a covariance matrix from discrete values; in a loop, I sum the products of the values from each pair of input variables. In the same loop, I accumulate the sum of each variable so that I can calculate the final covariance at the end using the standard formula of C(X,Y) = E(X*Y)-(avg(X)*avg(Y)). When I’m finished, I perform a standard matrix inversion. No problem so far.

I then made a copy of this function that performs the same operations, except I removed the variable summations and the second term from the covariance calculation, left instead with C(X,Y) = E(X*Y). The functions are exactly the same except for the extra operations performed by the first function; that is, I took no shortcuts, changed no math, changed no memory accesses (except the removal of the structure used to hold the sums). I’m using the same matrix inversion function.

And yet, the first function – the one that performs more math – actually runs faster than the second function, which is the first function with some steps stripped away.

I can’t see any reason for this to be. I’m not accessing global memory in this function; just a few local memory arrays and some register variables. The second function compiles into fewer registers than the first function. I am confident my timing mechanism is accurate (I’m actually using the timer from cutil). I tried adjusting the size of my input to see if the speed had to do with the size of the values before going into the inversion step, but that didn’t do it either.

So, now I’m out of ideas. Any advice on what I could investigate next? I’m guessing the compiler must be doing something for me, but I don’t see why it would optimize the first implementation and leave the second slower.

Thanks for your help!

You said that you’re not accessing global mem - If you’re not writing to global mem at the end, it is possible that the compiler is optimising the entire kernel away … could that be it?

No, I do write to global memory later in another function. This is a device function my kernel calls that performs this intermediate step. I use the results of this device function, so nothing should be optimized away. I can see the timing difference if I change which of the two functions I call from my kernel.

It is hard to determine the problem from your post. How do you measure the timing, because different function will have different overhead. Normally the performance of simple cuda program is bandwidth bound, so it is not necessary true that functions do more work run faster

I use the cutil timer to measure timing. In a loop where I repeatedly call my kernel, I first call cutResetTimer and cutStartTimer. I then load data to the device (the exact same data every time), invoke my kernel, copy the results to RAM, and then call cutStopTimer and cutGetTimerValue to get the total time required.

If I change nothing other than which of the two covariance functions I call within my kernel (recompiling between them – I do not test both kernels at the same time), I see a change in the time reported. It takes a larger amount of time if I call the device function that does not use sums or averages. This happens every time; I have run many tests to ensure that it’s not a fluke.

I’ve encountered the issue with bandwidth being the primary problem before, but in this case, if it was simple bandwidth, I would expect the times for the two versions to be basically the same. I’m also running the kernel on a relatively sizable chunk of data, so I’m doing a large number of matrix inversions (using the same inversion code with either function), so the math (or at least the shuffling with local memory) is taking more time than the transfers across the bus.

I ran additional tests where I removed the matrix inversion step, replacing it with a simple matrix copy (wondering if, perhaps, the matrix inversion somehow ran faster or slower depending on the nature of the input data). This did not change my results; both were faster, of course, but the function that performs additional mathematic steps is still a good bit faster than the version with steps removed.