Hi all,

I hope your help can make me clarify this issue.

I am trying to get the effective bandwidth through the formula

Bw_Effective= ( *Rb* + *Wb* ) / (t * 10^9)*

My device is the Kepler k40c with Theoretically bandwith 288,38 GB/s. CUDA SDK 9.0

N= Number of elemens to calculate

A = Number of atlas

K = Size of neighbourhood

P = Size of the patch

Blocks= Number of blocks in the grid. In my case (222)/10 * (222)/10 * (112)/10

Sh = Size of the shared memory needed for each block 22 or 20 o 18 it depends on K

For A=18 K=11, P=3, Blocks = 6348, Sh= 22 and N=5038523

the global memory version the calculation reads

**CUDA-GM**

Wb= 4 * N^3

Rb= 4 * N^3 * A * ( K^3 + 2 * K^3 * P^3)

*Rb* = 20154092

*Wb* = 2.65568E+13

we have Bw_Effective= ( 20154092+ 2.65568E+13) / (330.6 * 10^9) = **80.30 GB/s**

**CUDA-SM**

We have the same write bytes but the reads are reduced because we load the needed elements in each block only once and read from shared memory for the calculations.

Wb= 4 * N^3

Rb= 4 * N^3 * A * K^3 + 4 * A * Blocks * Sh^3 + 4 * N^3 * P^3

*Rb* = 20154092

*Wb* = 4.83396E+11

we have Bw_Effective= ( 20154092+ 4.83396E+11) / (139.58 * 10^9) = **3.46GB/s**

I am confused because the time is faster but the effective bandwith is smaller. This is not what I had expected but it makes sense from the formulas, due to the fact that we reduce the reads from Global Memory. However, this means that even with the SM I am not using the fully capability of the resources in my device and it can even much faster? if it is so, how?

Let me know if you need more information I willl be more than happy to provide it

Thanks in advance for any suggestion you might have