Optimizing calculation speed (yet another blocks, threads, allocation question)

I have a problem that needs to solve equations at 1,000,000 elements. Each thread does exactly the same thing. I have an old GTX Titan, with 14 SMX units. My thought to split is this problem is simple :

(1 element = 1 thread)
1,000,000 elements / 14 SMX units = 71429 elements/SMX (split up work evenly among all SMX units)
71429 elements/SMX / (512 elements/block) = 140 blocks/SMX (split up work within each SMX unit - where I have some confusion)
Total # of blocks = 140*14 = 1960 blocks

So in my launch config, I’d do something like <<<1960,512>>>
The total number of threads here would be 1,003,520, only wasting only 0.35% threads total up this point.

Now, for some of my confusion:

192 FP cores and 64 DP cores per SMX unit. I’m not even sure where this works into parallel optimization, or if it even does. Should I be looking for some multiple of 192 for FP work or 64 for DP work somewhere?

The SMX unit can have at most 64 warps (2048 threads) or 16 blocks allocated at time. Does CUDA automatically do this for you? (In this case each SMX would calculate 4 blocks with 512 threads each at once for a total of 2048 threads maxing out the capability) And if CUDA does do this automatically, allocation versus actual use? I believe I read somewhere that each SMX can only compute 2 blocks simultaneously. In which case I may be better off with a <<<980,1024>>> launch configuration?

When I change from 512 to 1024 to 256 I hardly see any performance change (below 256 I start to see a decrease). So I’m still just trying to figure this out and really master it.

I think CUDA 6.5 includes occupancy calculators, right?

Every kernel uses a some specific amount of shared memory and registers. If there are enough resources to have 2048 threads active on the SMX unit, it will happen automatically. If the registers usage is too much you have to force some spills in order to achieve that. The launch bounds http://docs.nvidia.com/cuda/cuda-c-programming-guide/#launch-bounds option has the effect of forcing the spills and still getting maximum threads active per SMX. If you use too much shared memory the only way to get 2058 threads is by reducing the size and use slower memory. These are competing optimization so if your program is more complex you need to find the balance between them.
So far for my applications, having more active threads per SMX was beneficial becasue the spills are pu in the fast L1 and L2 cache.

(I was under the impression that up to sm_35 one can have only 8 blocks active not 16, but you will get an warning anyway at compiling. )