Effective bandwidth between using shared memory and global memory

Hi all,

I hope your help can make me clarify this issue.
I am trying to get the effective bandwidth through the formula
https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/

Bw_Effective= ( Rb + Wb ) / (t * 10^9)*

My device is the Kepler k40c with Theoretically bandwith 288,38 GB/s. CUDA SDK 9.0
N= Number of elemens to calculate
A = Number of atlas
K = Size of neighbourhood
P = Size of the patch
Blocks= Number of blocks in the grid. In my case (222)/10 * (222)/10 * (112)/10
Sh = Size of the shared memory needed for each block 22 or 20 o 18 it depends on K

For A=18 K=11, P=3, Blocks = 6348, Sh= 22 and N=5038523
the global memory version the calculation reads

CUDA-GM

Wb= 4 * N^3
Rb= 4 * N^3 * A * ( K^3 + 2 * K^3 * P^3)

Rb = 20154092
Wb = 2.65568E+13

we have Bw_Effective= ( 20154092+ 2.65568E+13) / (330.6 * 10^9) = 80.30 GB/s

CUDA-SM
We have the same write bytes but the reads are reduced because we load the needed elements in each block only once and read from shared memory for the calculations.
Wb= 4 * N^3
Rb= 4 * N^3 * A * K^3 + 4 * A * Blocks * Sh^3 + 4 * N^3 * P^3

Rb = 20154092
Wb = 4.83396E+11

we have Bw_Effective= ( 20154092+ 4.83396E+11) / (139.58 * 10^9) = 3.46GB/s

I am confused because the time is faster but the effective bandwith is smaller. This is not what I had expected but it makes sense from the formulas, due to the fact that we reduce the reads from Global Memory. However, this means that even with the SM I am not using the fully capability of the resources in my device and it can even much faster? if it is so, how?

Let me know if you need more information I willl be more than happy to provide it

Thanks in advance for any suggestion you might have