I tried to reply but my post mark as spam from forum.
I don’t known why.
Thank you for your answer.
I’ll try to explain my code.
Unfortunately i cant post it but i’ll try to make it clear with a sample code.
I have a A array 3000x3000.
I made the A in one dimension array but in next i have it in two (for convenience).
Let as A like:
A = [
00 01 02 03 04…
05 06 07 08 09…
10 11 12 13 14…
15 16 17 18 19…
20 21 22 23 24
…
25 26 27 28 29…
]
I have a B array with same dimension.
In B store the sum of n-neighboring elements.
Now in B store the next data if n=1:
B[2,2] =
A[1,1] + A[1,2] + A[1,3] +
A[2,1] + A[2,2] + A[2,3] +
A[3,1] + A[3,2] + A[3,3]
For the urgently elements of A i make padding with zero values
but this is something that doesn’t matter.
I have two implementations:
First implementation have only one kernel and make the above operators.
Each thread calculate one of the above sums:
kernel threads:
thread0: sum for A[0,0]
thread1: sum for A[0,1]
thread2: sum for A[0,2]
…
thread3: sum for A[1,0]
thread4: sum for A[0,1]
…
etc
Second implementation have two kernels.
The first kernel calculate the offset of each thread
so that each element is distant from the other n-elements.
The second kernel make the above sums.
kernel2 threads:
thread0: sum for A[0,0]
thread1: sum for A[0,3]
…
thread3: sum for A[3,0]
thread4: sum for A[3,3]
…
etc
Next, Kernel1 make the offset so kernel2 threads have the above sums:
thread0: sum for A[0,1]
thread1: sum for A[0,4]
…
thread3: sum for A[3,1]
thread4: sum for A[3,4]
…
etc
Now… i have the next throughput results in compute capability 5.2 for A = 3000x3000 and n=3.
kernel1: 40M elements per second
kernel1: 70M elements per second
The time measure only each kernel execute.
All kernels do the same total repeats and have the same result.
In my real code i make some others operators but i scan the same elements as the above example code.
I made a C array equal to A and i made the sums of A element with B neighboring elements like that:
B[2,2] =
B[1,1] + B[1,2] + B[1,3] +
B[2,1] + A[2,2] + B[2,3] +
B[3,1] + B[3,2] + B[3,3]
I have the same results.
It’s like i have a conflict in global memory (this is contrary to theory)
or the first kernel with broadcast values of elements when calculate e.g. A[0,] and A[0,1] is slowest than the second implementation with offset.