and I let row = blockDim.x * blockIdx.x + threadIdx.x and no y coordinate. So if cuda calculates x[row], does it mean that all the threads in the same row all do this calculation? If so, different threads access to the same data, would there be any conflict?
And might there be the case that different blockDim.x blockIdx.x threadIdx.x but lead to the same value of row?