How to improve my kernel execution time? memory bound; occupancy; maxrregcount; cubin; math function

In order to improve my kernel execution time I need to improve the occupancy, since my kernel is absolutely global memory bounded!
To improve multiprocessor occupancy, I need to decrease the number of registers used.
I’m using 8*mps blocks (mps=4 for my 9600M GT) and 128 threads per block (using 256 makes no difference).

I tried to start by: compiling my code with --cubin, but I get an error message:
“more than one compilation phase specified”
So I can’t use decuda to see what’s going on, since it needs a .cubin file.
How do I fix this?

Then I tried to decrease the number of registers used by: instead of passing 10 arguments to my kernel, I could pass one pointer to a data structure in global memory, have my first thread of each block to copy parameters from global to shmem, and take advantage of the shmem broadcast capability. Surprisingly my kernel started using lots of more registers.
Is there any explanation for this?

I tried using --maxrregcount 10, but results got worse.

Give the fact that all my global memory accesses are coalesced, are there any advantages in using texture memory?

My kernel is pretty much about calculating offsets and accessing glmem, one of the few “payload” operations is this:
(x_glmem - y_shmem) * (x_glmem - y_shmem)
Are there any cuda math function or strategies for improving this operation?
For example I tried __mul24 which is supposed to be faster, but that didn’t make any difference.

After some hammering, I went from 16 registers to 13, but I need to be at 10 to have an occupancy of 100%.

UPDATE (while writing this):
I just tried using again --maxrregcount 10 and this time my kernel did in fact got faster (when I had 16 register and tried this, times got worse)! Occupancy jumped from 66% to 100% and execution times improved proportionally.
All my global accesses are coalesced, but I now have local memory accesses =(
Cuda profile warp_serialize is at 0. I have lots of branching, but divergent branchs are only about 0.03%.