Performance improvement by increasing hardware occupancy


I am looking for an application that can get a big speedup by reducing the register count and increasing the hardware occupancy. I have tried a few from the SDK but I only get around 5% speedup (compared to the unoptimized, higher register count version). I think the reason is that the applications I have tried are not bandwidth bound (i.e. they don’t have a lot of memory transactions). I want to get a measurement of the performance improvement one could get for this kind of applications by increasing the occupancy. I am hoping people in the forum can point me to a few applications like this.

Thank you,


In my experience, increasing occupancy does not necessarily increase performance. For some benchmarks, see:…mp;hl=occupancy All of the kernels benchmarked in that thread are 100% memory bandwidth bound.

If you really need to find an example where occupancy optimization makes a big difference you are probably going to have to find a kernel with tons of complex math functions in it that get the register count way up so the occupancy is extremely tiny and then optimize that (if possible) to a decent occupancy. The rule of thumb on the forums (and in the new CUDA best practices guide) states that optimizing the occupancy over 50% will gain little.