About conflicts in Global memory

user144648 · April 26, 2023, 7:44pm

The bellow code is an example.

I have a array A in global memory
e.g. A[6] as A[0, 1, 2, 3, 4, 5]
Let as suppose to sum all elements with each element with previous:
e.g. sum += A[i] + A[i-1]

In first implementation i have:
thread0: A[1] + A[0]
thread1: A[2] + A[1]
thread2: A[3] + A[2]
thread3: A[4] + A[3]
thread4: A[5] + A[4]
In this scenario i believe i have conflicts:
thread0 and thread1 if read the same time A[1]
thread2 and thread1 if read the same time A[2] etc

In second implementation I made changes to code and now i have two kernels that run by offset (odd/even) like this:
Kernel0:
thread0: A[1] + A[0]
thread2: A[3] + A[2]
thread4: A[5] + A[4]
Kerenel1:
thread1: A[2] + A[1]
thread3: A[4] + A[3]

The second implementation run much quickly than first.
In theory says that reads from the same address does not create conflicts even if many threads want access (compute capability 5.2).
Why the second implementation run faster?

Robert_Crovella · April 26, 2023, 11:26pm

The first implementation cannot work correctly (give proper results) without use of atomics, synchronization, or some other method to resolve simultaneous access. The second method simply follows a typical sweep reduction approach, and of course can be made to work correctly.

If we stipulate that the first method produces incorrect results, there is no particular reason it should run slower than the second method. If that is your observation, then there is some other aspect of your code that is giving rise to that. I cannot conjecture what that may be with no code to inspect.

user144648 · April 27, 2023, 5:48am

Thank you from your answer.
Toy are correct.

I have fault in above sample code.
The sum is not variable but array like this:
sum[5]

So we have:
implementation1:
thread0: sum[0] = A[1] + A[0]
thread1: sum[1] = A[2] + A[1]
thread2: sum[2] = A[3] + A[2]
thread3: sum[3] = A[4] + A[3]
thread4: sum[4] = A[5] + A[4]

implementation2:
Kernel0:
thread0: sum[0] = A[1] + A[0]
thread2: sum[2] = A[3] + A[2]
thread4: sum[4] = A[5] + A[4]
Kerenel1:
thread1: sum[1] = A[2] + A[1]
thread3: sum[3] = A[4] + A[3]

Do we have any conflict in implementention1?
Why the implementation2 is faster than implementation1?

ps
The link for sweep reduction is missing.

Robert_Crovella · April 27, 2023, 3:13pm

I have fixed the link.

No, no conflict.

I wouldn’t expect it to be. You’re advancing a statement or claim without any support, and asking me or someone else to explain it.

Let me try that with you:

“Why is the world flat?”

If you asked me that question, I would say “The world is not flat. How did you come to that conclusion, which observations and measurements did you make?”

You’ve given no observations (except the very highest one) and no measurements.

user144648 · May 1, 2023, 6:09am

I tried to reply but my post mark as spam from forum.
I don’t known why.

Thank you for your answer.
I’ll try to explain my code.
Unfortunately i cant post it but i’ll try to make it clear with a sample code.

I have a A array 3000x3000.
I made the A in one dimension array but in next i have it in two (for convenience).
Let as A like:
A = [
00 01 02 03 04…
05 06 07 08 09…
10 11 12 13 14…
15 16 17 18 19…
20 21 22 23 24
…
25 26 27 28 29…
]
I have a B array with same dimension.
In B store the sum of n-neighboring elements.
Now in B store the next data if n=1:
B[2,2] =
A[1,1] + A[1,2] + A[1,3] +
A[2,1] + A[2,2] + A[2,3] +
A[3,1] + A[3,2] + A[3,3]
For the urgently elements of A i make padding with zero values
but this is something that doesn’t matter.

I have two implementations:
First implementation have only one kernel and make the above operators.
Each thread calculate one of the above sums:
kernel threads:
thread0: sum for A[0,0]
thread1: sum for A[0,1]
thread2: sum for A[0,2]
…
thread3: sum for A[1,0]
thread4: sum for A[0,1]
…
etc

Second implementation have two kernels.
The first kernel calculate the offset of each thread
so that each element is distant from the other n-elements.
The second kernel make the above sums.
kernel2 threads:
thread0: sum for A[0,0]
thread1: sum for A[0,3]
…
thread3: sum for A[3,0]
thread4: sum for A[3,3]
…
etc
Next, Kernel1 make the offset so kernel2 threads have the above sums:
thread0: sum for A[0,1]
thread1: sum for A[0,4]
…
thread3: sum for A[3,1]
thread4: sum for A[3,4]
…
etc

Now… i have the next throughput results in compute capability 5.2 for A = 3000x3000 and n=3.
kernel1: 40M elements per second
kernel1: 70M elements per second
The time measure only each kernel execute.
All kernels do the same total repeats and have the same result.
In my real code i make some others operators but i scan the same elements as the above example code.
I made a C array equal to A and i made the sums of A element with B neighboring elements like that:
B[2,2] =
B[1,1] + B[1,2] + B[1,3] +
B[2,1] + A[2,2] + B[2,3] +
B[3,1] + B[3,2] + B[3,3]
I have the same results.

It’s like i have a conflict in global memory (this is contrary to theory)
or the first kernel with broadcast values of elements when calculate e.g. A[0,] and A[0,1] is slowest than the second implementation with offset.

Topic		Replies	Views
Thread conflicts in stencil computations CUDA Programming and Performance	11	2862	October 13, 2010
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	26091	July 5, 2007
device global memory update questions CUDA Programming and Performance	7	5984	April 20, 2009
Does this have bank conflict? CUDA Programming and Performance	3	1598	October 31, 2008
Problem with bank conflict. Something wrong with my experiment?Confused! CUDA Programming and Performance	4	1342	February 26, 2009
How to avoid writing conflcit on writing to the same location on global memory? CUDA Programming and Performance	12	3304	March 27, 2009
Very strange share memory bank conflicts CUDA Programming and Performance cuda	4	607	November 2, 2021
Mutual exclusion or Reduction on global memory? CUDA Programming and Performance	13	9294	September 13, 2008
Shared memory access patterns CUDA Programming and Performance	2	1181	March 4, 2010
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8397	April 15, 2011

About conflicts in Global memory

Related topics