About conflicts in Global memory

The bellow code is an example.

I have a array A in global memory
e.g. A[6] as A[0, 1, 2, 3, 4, 5]
Let as suppose to sum all elements with each element with previous:
e.g. sum += A[i] + A[i-1]

In first implementation i have:
thread0: A[1] + A[0]
thread1: A[2] + A[1]
thread2: A[3] + A[2]
thread3: A[4] + A[3]
thread4: A[5] + A[4]
In this scenario i believe i have conflicts:
thread0 and thread1 if read the same time A[1]
thread2 and thread1 if read the same time A[2] etc

In second implementation I made changes to code and now i have two kernels that run by offset (odd/even) like this:
thread0: A[1] + A[0]
thread2: A[3] + A[2]
thread4: A[5] + A[4]
thread1: A[2] + A[1]
thread3: A[4] + A[3]

The second implementation run much quickly than first.
In theory says that reads from the same address does not create conflicts even if many threads want access (compute capability 5.2).
Why the second implementation run faster?

The first implementation cannot work correctly (give proper results) without use of atomics, synchronization, or some other method to resolve simultaneous access. The second method simply follows a typical sweep reduction approach, and of course can be made to work correctly.

If we stipulate that the first method produces incorrect results, there is no particular reason it should run slower than the second method. If that is your observation, then there is some other aspect of your code that is giving rise to that. I cannot conjecture what that may be with no code to inspect.

Thank you from your answer.
Toy are correct.

I have fault in above sample code.
The sum is not variable but array like this:

So we have:
thread0: sum[0] = A[1] + A[0]
thread1: sum[1] = A[2] + A[1]
thread2: sum[2] = A[3] + A[2]
thread3: sum[3] = A[4] + A[3]
thread4: sum[4] = A[5] + A[4]

thread0: sum[0] = A[1] + A[0]
thread2: sum[2] = A[3] + A[2]
thread4: sum[4] = A[5] + A[4]
thread1: sum[1] = A[2] + A[1]
thread3: sum[3] = A[4] + A[3]

Do we have any conflict in implementention1?
Why the implementation2 is faster than implementation1?

The link for sweep reduction is missing.

I have fixed the link.

No, no conflict.

I wouldn’t expect it to be. You’re advancing a statement or claim without any support, and asking me or someone else to explain it.

Let me try that with you:

“Why is the world flat?”

If you asked me that question, I would say “The world is not flat. How did you come to that conclusion, which observations and measurements did you make?”

You’ve given no observations (except the very highest one) and no measurements.

I tried to reply but my post mark as spam from forum.
I don’t known why.

Thank you for your answer.
I’ll try to explain my code.
Unfortunately i cant post it but i’ll try to make it clear with a sample code.

I have a A array 3000x3000.
I made the A in one dimension array but in next i have it in two (for convenience).
Let as A like:
A = [
00 01 02 03 04…
05 06 07 08 09…
10 11 12 13 14…
15 16 17 18 19…
20 21 22 23 24

25 26 27 28 29…
I have a B array with same dimension.
In B store the sum of n-neighboring elements.
Now in B store the next data if n=1:
B[2,2] =
A[1,1] + A[1,2] + A[1,3] +
A[2,1] + A[2,2] + A[2,3] +
A[3,1] + A[3,2] + A[3,3]
For the urgently elements of A i make padding with zero values
but this is something that doesn’t matter.

I have two implementations:
First implementation have only one kernel and make the above operators.
Each thread calculate one of the above sums:
kernel threads:
thread0: sum for A[0,0]
thread1: sum for A[0,1]
thread2: sum for A[0,2]

thread3: sum for A[1,0]
thread4: sum for A[0,1]


Second implementation have two kernels.
The first kernel calculate the offset of each thread
so that each element is distant from the other n-elements.
The second kernel make the above sums.
kernel2 threads:
thread0: sum for A[0,0]
thread1: sum for A[0,3]

thread3: sum for A[3,0]
thread4: sum for A[3,3]

Next, Kernel1 make the offset so kernel2 threads have the above sums:
thread0: sum for A[0,1]
thread1: sum for A[0,4]

thread3: sum for A[3,1]
thread4: sum for A[3,4]


Now… i have the next throughput results in compute capability 5.2 for A = 3000x3000 and n=3.
kernel1: 40M elements per second
kernel1: 70M elements per second
The time measure only each kernel execute.
All kernels do the same total repeats and have the same result.
In my real code i make some others operators but i scan the same elements as the above example code.
I made a C array equal to A and i made the sums of A element with B neighboring elements like that:
B[2,2] =
B[1,1] + B[1,2] + B[1,3] +
B[2,1] + A[2,2] + B[2,3] +
B[3,1] + B[3,2] + B[3,3]
I have the same results.

It’s like i have a conflict in global memory (this is contrary to theory)
or the first kernel with broadcast values of elements when calculate e.g. A[0,] and A[0,1] is slowest than the second implementation with offset.