Luxury problem: kernel is too fast :( Please help me beat the compiler


Before you read the rest of this post I would like to add a personal note, I don’t like GFLOPs much as a “metric” but there are others who desire to hear guestimates of GFLOPs being thrown around…

So for research interest I’m trying to increase the amount of work being done in the “hot” inner loop, i want to be going from O(n^2) to O(n^3).

The previous implementation was really fast, able to to do 25.16 GFLOP in 0.104 s (as guestimates go… ). What i’m basically doing is increasing the amount of work by a factor of about ~160, the problem is that my time only scales by about a factor of about 25. Which doesn’t give me credible GFLOPs numbers.

I’m thinking the compiler is much smarter and finding shortcuts. So if you would look at this code snippet and tell me if there are any shortcuts being taken here:

for(int i = 0; i < loopDepth; i++)


				  val += __sinf(i*0.123f);

#pragma unroll

				for(beam = 0; beam < loopDepth; beam++) 


					val += shared_bin[i];	

					reg_matrix[beam] += y_bin_val*shared_bin[beam]*val;




The reg_matrix is a sort of thread local array that is kept from local memory by aggressive unrolling throughout the kernel… (thus the #pragma unroll)

The “val” stuff is basically garbage that I’ve added just to try to beat the compiler. Seems I’m not… So again, the total time goes from 0.1 s to 2.5 s while I’m increasing the number of computations about ~160.

Anyways, it’s probably obvious and hopefully someone can give me a good hint.



Is that CPU code or CUDA code?


That’s GPU code. Please let me know if any questions.

Again a lot of the values are stored in registers so this might not look like your typical CUDA code with threadIdx all about the place :)

I am just assuming that “val” and “reg_matrix” are per-thread variables…

If your loopDepth is known apriori, consider finding “sinf” once and store in global memory. Then, you can access it via Texture fetch… That will remove “sinf” being called REPEATEDLY in that code… The rest all looks normal…

Yes, that is correct… And loopDepth is known at compile time.

The “val” and “__sinf” computation isn’t really needed at all. It’s just GARBAGE that i added to try and get the compiler not to skip any steps in my loop. Again for testing purposes I’m trying to add extra work to kernel to see how it copes with heavier computations.

Do you think the program performs every step in both loops ?



Consider pre-fixing “volatile” keyword to avoid compiler optimization. The compiler can easily choose NOT to do garbage calculations.

Ok, i will look into how “volatile” is used.

Notice that

reg_matrix[beam] += y_bin_val*shared_bin[beam]*val

where “reg_matrix[…]” is finally written to global memory.

Interesting… Did you overload all your multiprocessors (SMs) in the first version (before increasing the workload) ???

Speedups saturate beyond a point - after all the compute resources is finite. It is possible that you have hit saturation…

I’m sorry, what exactly do you mean by overloading?

Before increasing the workload i was doing ~240 GFLOPs on my GTX260 which is already quite good. We are however considering increasing the complexity of the problem which is why I am doing this testing…

I just meant – Were there enough blocks to make all SMs work without exposing latencies??? Sorry about that.

hmm…, Its really baffling to see 160x workload increases time only by 25x :-(

Yes i agree. :)

Therefore i was thinking the only option is that the compiler is being smart here and actually not performing all of the operations…

You may want to examine the PTX and ascertain that the compiler has not out-done your workload. (atleast at the PTX level)

It seems you can only draw two conclusions from all this:

  1. You have invented the computational equivalent of perpetual motion
  2. Your measurements are incorrect.

If you are willing to discount the first possibility then that leaves you with two possible explanations:

  1. Your operation estimate is incorrect
  2. Your timing is incorrect

The operation count requires some disassembly to verify that the code is really doing all the operations you think it should (either because of a compiler bug, optimization or a coding mistake), but the timing should be trivial to verify.

adidday, I couldn’t agree more.

About the timing, this was my first suspiscion so I’ve triple checked with the visual profiler. That leaves us with the operation estimate:

I’ve checked this with an R&D scientist here at the company aswell as made sure that my algorithm is running correctly. I can disclose that I’m doing Space-Time Adaptive Processing on a very large dataset. My first implementation was doing somewhere about 80 GFLOPs, which can be compared with what DARPA did, they managed to get 110 GFLOPs out of their slightly older card ( G8800, not sure). After I did some trixie register optimization i could fit more data on-chip and decrease the amount of syncing needed. This brought my computational speed down from about 0.3 s to 0.1 s…

If there is someone else out there who has done the STAP algorithm i would be interested to hear about your results and how you estimated your FLOPs…

I don’t like FLOPs :(

But then again maybe each application needs to be disassembled as avidday suggested…

hmm maybe i can force the equivalent CPU implementation to do the same number of operations (trying to make sure the optimizer doesn’t do anything naughty) and try to compare the speedup increase/decrease instead…

Benchmarking against another implementation on other fixed hardware feels much more relevant in a case like this…

Avidday was right. I checked and rechecked until i realized i had been confused by my thread blocks splitting up the matrices… I got the GFLOP count wrong by about a factor of 10…

This gives me reasonable results and brings my GFLOPs count up to ~320. Personally I don’t care much for this stuff but it makes my supervisor happy :)

So sorry for taking up anyones time because of my faulty mistake. All of the thread blocks can sometimes make things less intuitive.

Again, sorry about that…

Shame, I was kinda hoping you had discovered computational perpetual motion…