Difference on "conflicts" between cuda and others? memory access conflict

Dear all,
I think there’re 2 possible write memory conflicts:
1, inter-thread: many threads access the same shared memory
2, inter-block: many threads from many blocks access the same global memory
Could you give some suggestions on how to handle these 2 conflicts?
I think of
1, code so that there’re no conflicts. but this can’t be done in “histogram” case
2, semaphore or something. but this needs “atom operation”, which is lack in cuda.

In fact, i’m still not clear how other parallel architectures handle conflicts. So my questions are:
1, what’s the difference on “structure” between cuda and others, what makes cuda special on the memories?
2, how cuda’s conflict handling is different from the conflicting handling on existing parallel architectures, no matter whether the mechnism is in the hardware or in software?

Thank you very much!