Coalesced Access Is this coalesced?

ribbery · June 16, 2008, 4:03pm

Say I have a 1D array of size N, I want to shift the data by 1 entry.

idx = blockIdx.x * blockDim.x + threadId.x;

if (idx<N-1) data_out[idx] = data_in[idx+1];

Is this coalesced read for data_in?

pcchen · June 16, 2008, 4:58pm

If data_in is aligned on 16 bytes then it’s probably not. You may want to try

if(idx >= 1 && idx < N) data_out[idx-1] = data_in[idx];

Of course, this way now you don’t have coalesced write. If you really want to have both coalesced read and coalesced write, you can use shared memory to buffer it.

ribbery · June 16, 2008, 5:16pm

Data are of float type.

How could it be possible using Shared Memory?

I still have to offset the index of data_in or data_out by 1.

If data_in is aligned on 16 bytes then it’s probably not. You may want to try
if(idx >= 1 && idx < N) data_out[idx-1] = data_in[idx];
Of course, this way now you don’t have coalesced write. If you really want to have both coalesced read and coalesced write, you can use shared memory to buffer it.

[snapback]394504[/snapback]

kristleifur · June 16, 2008, 5:54pm

Thread 0 must read/write position 0, and thread 1 must write position 1, up to thread 15. This cycles every 16 threads.

Hence, use thread 0 to read array[0] into shared memory, thread 1 reads array[1], etc. For 16 threads:

shared[threadIdx.x] = in_array[threadIdx.x];

__synctreads();

outArray[threadIdx.x +/- 1] = shared[threadIdx.x]

This way, the threads read directly, but write in an offset, except that thread 0 or the very last thread don’t participate.

This is a quick hint - does this make sense? I can explain it better if required.

Ailleur · June 16, 2008, 5:55pm

I think this is something you could try: you have each thread i load its correspondant float i into shared memory at index i and then have thread i read shared data i+1 and write it to global memory index i.

This way both reads and writes are coalesced.

I have no idea if its gonna be faster than non coalesced reads and/or writes.

And it will only work as is if all your data can fit into one thread block of shared memory (16kb) , or else you will have to deal with boundry conditions.

As for the original question, as was answered, it is not coalesced.

And as a disclaimer, all that i have said may well be completly wrong, feel free to correct me!

edit: looks like kristleifur beat me to the punch while i was writing this. Though i dont agree with his last line of code!

kristleifur · June 16, 2008, 5:57pm

I agree - but IMO,

The boundary conditions will have to be handled, but they won’t be that bad.

You don’t have to look that much into how much fits into shared.

Btw, the “transpose” example is along similar lines. It rotates a 2D array instead of shifting a 1D one, but it’s similar in spirit.

ribbery · June 16, 2008, 7:41pm

kristleifur,

outArray[threadIdx.x +/- 1] = shared[threadIdx.x]

Is it coalesced for threadIdx.x +/- 1? (Well, this question again… :blink: )

I understand the “transpose” matrix example but this shifting problem add an offset.

hm… I think the idea of adding an non-participated thread would work.

kristleifur · June 17, 2008, 9:17am

No! :D You’re right. Sorry!

outArray[threadIdx.x] = shared[threadIdx.x +/- 1]