I tried to reply but my post mark as spam from forum.

I don’t known why.

Thank you for your answer.

I’ll try to explain my code.

Unfortunately i cant post it but i’ll try to make it clear with a sample code.

I have a A array 3000x3000.

I made the A in one dimension array but in next i have it in two (for convenience).

Let as A like:

A = [

00 01 02 03 04…

05 06 07 08 09…

10 11 12 13 14…

15 16 17 18 19…

20 21 22 23 24

…

25 26 27 28 29…

]

I have a B array with same dimension.

In B store the sum of n-neighboring elements.

Now in B store the next data if n=1:

B[2,2] =

A[1,1] + A[1,2] + A[1,3] +

A[2,1] + A[2,2] + A[2,3] +

A[3,1] + A[3,2] + A[3,3]

For the urgently elements of A i make padding with zero values

but this is something that doesn’t matter.

I have two implementations:

First implementation have only one kernel and make the above operators.

Each thread calculate one of the above sums:

kernel threads:

thread0: sum for A[0,0]

thread1: sum for A[0,1]

thread2: sum for A[0,2]

…

thread3: sum for A[1,0]

thread4: sum for A[0,1]

…

etc

Second implementation have two kernels.

The first kernel calculate the offset of each thread

so that each element is distant from the other n-elements.

The second kernel make the above sums.

kernel2 threads:

thread0: sum for A[0,0]

thread1: sum for A[0,3]

…

thread3: sum for A[3,0]

thread4: sum for A[3,3]

…

etc

Next, Kernel1 make the offset so kernel2 threads have the above sums:

thread0: sum for A[0,1]

thread1: sum for A[0,4]

…

thread3: sum for A[3,1]

thread4: sum for A[3,4]

…

etc

Now… i have the next throughput results in compute capability 5.2 for A = 3000x3000 and n=3.

kernel1: 40M elements per second

kernel1: 70M elements per second

The time measure only each kernel execute.

All kernels do the same total repeats and have the same result.

In my real code i make some others operators but i scan the same elements as the above example code.

I made a C array equal to A and i made the sums of A element with B neighboring elements like that:

B[2,2] =

B[1,1] + B[1,2] + B[1,3] +

B[2,1] + A[2,2] + B[2,3] +

B[3,1] + B[3,2] + B[3,3]

I have the same results.

It’s like i have a conflict in global memory (this is contrary to theory)

or the first kernel with broadcast values of elements when calculate e.g. A[0,] and A[0,1] is slowest than the second implementation with offset.