My experience has been limited so take these comments with a grain a salt :)
higher occupancy does not guarantee better performance
register usage is just one of the tradeoffs. For example, in my kernel, I increased register usage to load some of the global memory into the registers. This decreased occupancy but my kernel ran faster because of more GPU-friendly global memory accesses.
nvcc does not appear to be optimizing for common expressions so you may want to try manually pulling out all the common expressions and calculating them once explicitly. In my case, this increased register usage but gave me more performance by (probably) reducing global memory access.
Spencer, can you provide an example of this issue? Note that not all optimizations show up in the PTX, because PTX is a virtual machine. Machine-optimized code is generated after PTX and stored in binary form (we don’t expose the actual machine assembly languages, because they will likely gradually change from GPU to GPU).
But it’s possible there are still compiler bugs with optimizations (and there have already been a lot of compiler performance improvements that will be in the next major CUDA release).
If you have an example, we’d like to see it!
As I’ve said before, higher occupancy does not always translate to higher performance. Having more active threads per multiprocessor can help hide memory latency if you are memory bound, and can also better fill the instruction pipeline so there are no bubbles. That said, if you have to make your code bend over backward to reduce register count, it can cause your instruction count to increase, which may be an overall loss if you are not memory bound.
Here’s a rule to live by:
The NUMBER ONE optimization you can make is to ensure your memory accesses are coalesced. This can mean the difference between hundreds of cycles of latency PER THREAD and hundreds of cycles of latency PER WARP. In other words it can mean ORDERS OF MAGNITUDE!
I’ve seen nvcc do common expression optimizations in several cases. Actually, in those cases register use was lower when nvcc compiled the code as opposed to me storing a common expression result in a temporary variable. It appears that nvcc has an easier time optimizing common expressions when the common part is written in the same way (character-wise) and is the firt part of the expression. For example, (1) seems to be preferred over (2):
These are by no means guidelines, just something I noticed when rearranging my source code and checking the effect on the resulting assembly. Also, the code aboe is just a representative illustration.
Where plane0 is stored in global memory. When I looked at the PTX output, it looked like the global memory reads were not been optimized e.g. “src1[2x ] + src1[2x+1]” occurs twice in the code and the ptx output says each read e.g. src1[2*x] were read from global memory twice.
When I rewrote the code, I did this (taking into account that I am having problem with mis-aligned 32 bit reads which you know about)
This code ran much faster though it looks a lot uglier. Some of it may be related to how I re-ordered the global memory access but some of it is likely to be related to the fact that I calculated the intermediate expressions explicitly and did only one read from global for each access to srcX. [Note there were other intermediate versions where the kernel ran slower when I did not do the “i16p_*” code changes.]
You will have to ask your compiler guys as to whether nvcc can safely do what I did by hand.