Coalesced Access Is this coalesced?

Say I have a 1D array of size N, I want to shift the data by 1 entry.

idx = blockIdx.x * blockDim.x + threadId.x;

if (idx<N-1) data_out[idx] = data_in[idx+1];

Is this coalesced read for data_in?

If data_in is aligned on 16 bytes then it’s probably not. You may want to try

if(idx >= 1 && idx < N) data_out[idx-1] = data_in[idx];

Of course, this way now you don’t have coalesced write. If you really want to have both coalesced read and coalesced write, you can use shared memory to buffer it.

Data are of float type.

How could it be possible using Shared Memory?

I still have to offset the index of data_in or data_out by 1.

Thread 0 must read/write position 0, and thread 1 must write position 1, up to thread 15. This cycles every 16 threads.

Hence, use thread 0 to read array[0] into shared memory, thread 1 reads array[1], etc. For 16 threads:

shared[threadIdx.x] = in_array[threadIdx.x];


outArray[threadIdx.x +/- 1] = shared[threadIdx.x]

This way, the threads read directly, but write in an offset, except that thread 0 or the very last thread don’t participate.

This is a quick hint - does this make sense? I can explain it better if required.

I think this is something you could try: you have each thread i load its correspondant float i into shared memory at index i and then have thread i read shared data i+1 and write it to global memory index i.

This way both reads and writes are coalesced.

I have no idea if its gonna be faster than non coalesced reads and/or writes.

And it will only work as is if all your data can fit into one thread block of shared memory (16kb) , or else you will have to deal with boundry conditions.

As for the original question, as was answered, it is not coalesced.

And as a disclaimer, all that i have said may well be completly wrong, feel free to correct me!

edit: looks like kristleifur beat me to the punch while i was writing this. Though i dont agree with his last line of code!

I agree - but IMO,

The boundary conditions will have to be handled, but they won’t be that bad.

You don’t have to look that much into how much fits into shared.

Btw, the “transpose” example is along similar lines. It rotates a 2D array instead of shifting a 1D one, but it’s similar in spirit.


outArray[threadIdx.x +/- 1] = shared[threadIdx.x]

Is it coalesced for threadIdx.x +/- 1? (Well, this question again… :blink: )

I understand the “transpose” matrix example but this shifting problem add an offset.

hm… I think the idea of adding an non-participated thread would work.

No! :D You’re right. Sorry!

outArray[threadIdx.x] = shared[threadIdx.x +/- 1]