Coalesced shared memory access? Read and write from which thread to which?

Hello , I am wondering if I have coalesced memory access in my code.

But first something to clarify.

If I have

myshared[threadIdx.x] = myglobal[threadIdx.x]

, ok it is coalesced.
If I have

myshared[threadIdx.x] = myglobal[threadIdx.x + some_constant_value]

, it is coalesced I think.
If I have

myshared[threadIdx.x + some_constant_value] = myglobal[threadIdx.x ]

, it is coalesced I think.

Now , If I have :

ind = ( threadIdx.y + blockIdx.y * blockDim.y ) * NumberOfCols + ( threadIdx.x + blockIdx.x * blockDim.x )

myshared[threadIdx.x] = myglobal[ind]

, I think it is ok

myshared[threadIdx.x + 1 ]  = myglobal[ind]

,this?

And finally , if I have my global data and I want to access different values.

I want for example to have a pixel in the centre of a grid:

myshared[threadIdx.x +1] = myglobal[ind]

I want now to compute the left value of it,so :

Left_ind  = ind - 1

myshared[threadIdx.x] = myglobal

And the right value:

Right_ind  = ind + 1

myshared[threadIdx.x + 2] = myglobal

I can’t understand lastly this thing.For example here:

myshared[threadIdx.x +1] = myglobal[ind]

Since all threads are running concurrently , and myshared is going to have:

myshared[0 +1]
myshared[1 +1]
myshared[2 +1]
...

and myglobal will have all together!

myglobal[0]
myglobal[1]
myglobal[2]
....

how can I assure that myshared[0+1] will refer to myglobal[0] and not myglobal[4] for example?
Or the opposite with myshared.

Am I missing something here?

Thank you!

I would interpret,

ind = ( threadIdx.y + blockIdx.y * blockDim.y ) * NumberOfCols + ( threadIdx.x + blockIdx.x * blockDim.x)

As:

ind = (index_A + offset_A) * offset_C + (index_B + offset_B)

index_A + offset_A should be coalesced
index_B + offset_B should be coalesced
index_A + offset_A + index_B + offset_B should be semi-coalesced
But I doubt whether (index_A + offset_A) * offset_C + (index_B + offset_B) is coalesced
Or, quoting you, am I missing something here?

“how can I assure that myshared[0+1] will refer to myglobal[0] and not myglobal[4] for example?”

Shouldn’t you calculate and store ind per thread - this would mean ind would differ per thread…?

myshared[i] = myglobal[ind[i]] for example…?

I am sorry that I lost you…
If you could explain a little more on what I ask.

Also , I didn’t understand the

myglobal[ind[i]]

you said.
Since ind hold for the total number of threads , you say to have ind as an array?

Thanks!

Well, you firstly asked whether,

ind = ( threadIdx.y + blockIdx.y * blockDim.y ) * NumberOfCols + ( threadIdx.x + blockIdx.x * blockDim.x)

would be coalesced

I replied that I doubt that it is - the spacing between elements should be proportional to numberofcols if I am not mistaken

Secondly, you asked:
“how can I assure that myshared[0+1] will refer to myglobal[0] and not myglobal[4] for example?”

I replied that (given the very definition/ formula of ind,) ind would likely differ for each thread; hence each thread would calculate and store its own ind value, I would think
If this is so, each thread would store ind either as a local or shared variable; in the latter case, it would probably imply something like ind[i]
If each thread uses its ind to access global memory, and its ind value is stored in shared rather than local memory, you would read global memory as myshared[i] = myglobal[ind[i]]
Put differently, ind declared as a local variable should be the same as ind stored in shared memory, per thread - e.g. ind[i] (ind == ind[i])
Do all threads (in a block) use the same ind, or do each thread calculate its own ind - I would think the latter

Perhaps you should be more specific as to why you think “myshared[0+1] will refer to myglobal[0] and not myglobal[4] for example”

Perhaps you should be more specific as to why you think "myshared[0+1] will refer to myglobal[0] and not myglobal[4] for example"

I mean , that when the kernel is been executed , all the threads run in parallel (concurently) , hence ind will contain at the same time all its values ( 0,1,5,2,8,12,7…) .

The same applies to myshared , ( myshared[threadIdx.x + 1] )

So, since all threads are executed in “arbitrary” order (parallel) how do I know that
myglobal[0] will be assigned to myshared[0] and not something else as mygloba[7] = myshared[3]

As for the

ind = ( threadIdx.y + blockIdx.y * blockDim.y ) * NumberOfCols + ( threadIdx.x + blockIdx.x * blockDim.x)

it is used in every matrix multiplication example and it is fully coalesced.

And my main problem is how to achieve coalesced when I want to compute the Left and Right values I am refering to the beginning of the post.

“…it is used in every matrix multiplication example (and it is fully coalesced.)”
I agree to the former, but not necessarily to the latter
Nevertheless, I have no problem admitting when I am wrong; so, stipulate the thread block dimensions, and specify the value of numberofcols so that one can substitute the thread block and thread grid dimensions, to see if it is coalesced

“So, since all threads are executed in “arbitrary” order (parallel) how do I know that
myglobal[0] will be assigned to myshared[0] and not something else as mygloba[7] = myshared[3]”

In my view, based on the formula of ind, each thread would have an unique ind value - it would either be coalesced or non-coalesced - but each thread would map to its own global memory address/ variable
Again, substitute block/ grid dimension values per thread for ind to verify this

Hello ,

I didn’t say that you were wrong , just I say what I know so far ! :)

I didn’t understand that way you said about substitute.You mean what numbers I use?Does it matter?Doesn’t

ind = ( threadIdx.y + blockIdx.y * blockDim.y ) * NumberOfCols + ( threadIdx.x + blockIdx.x * blockDim.x)

always be coalesced?

I am using the dimensions of a pixel as : row = 80 col = 64

I am using 256 threads/block and 20 blocks ( 4 in x and 5 in y).

But ,generally because I want to compute more than one images , I am doing:

__shared__ int myshared[tile_width][tile_width + 2];

    int bx = blockIdx.x , by = blockIdx.y;
    int tx = threadIdx.x , ty = threadIdx.y;

    int RowIdx = ty + by * tile_width;
    int ColIdx = tx + bx * tile_width;

and then looping :

for (int i = 0; i < N; i++ ) {   //N is number of images

           __syncthreads();

           J = RowIdx * Cols + ColIdx + Rows * Cols * i;

            myshared[ty][tx + 1] = *( dev_input + J );

//taking into account boundary conditions
            if ( tx == 0 ) myshared[ty][tx] = *( dev_input + (J-1)  );
	    if ( tx == tile_width -1 ) myshared[ty][tx+2] =  *( dev_input + (J+1) );
            
             __syncthreads();
const int  tile_width = 16;

    dim3 dimGrid( (Cols / tile_width) , (Rows / tile_width) , N );
    dim3 dimBlock(  tile_width ,  tile_width );

So , the (J-1) refers to the left pixel , J+1 to the right , J to the centre.

I used [ty][tx +1] for the centre pixel , [ty][tx] for the left and [ty][tx+2] for the right.

I want to know if the above is coalesced or not and I want to understand it!

Thank you very much!

So, taking for example thread nr 254 and thread nr 255 of block 19

thread 254, block 19:
threadIdx.x = 14; threadIdx.y = 15; blockIdx.x = 3; blockIdx.y = 4

ind = (15 + 4 * 16) * (NumberofCols) + (14 + 3 * 16)
ind = (79) * (NumberofCols) + (62)

thread 255, block 19:
threadIdx.x = 15; threadIdx.y = 15; blockIdx.x = 3; blockIdx.y = 4

ind = (15 + 4 * 16) * (NumberofCols) + (15 + 3 * 16)
ind = (79) * (NumberofCols) + (63)

Seems coalesced, my fault

Ok, I saw your edit , ok.
Number of columns is 64 and Number of rows is 80 as I wrote.

So , coalescence in simple words means that for adjacent threads there should be adjacent elements?

And so ,to see if these :

myshared[ty][tx + 1] = *( dev_input + J );
myshared[ty][tx] = *( dev_input + (J-1) );
myshared[ty][tx+2] = *( dev_input + (J+1) );

are coalescent do I have to do check that?

And how I do this correctly for 2 dimensions? (myshared [ty][tx] )

Thanks!

how have you declared J, and how have you declared dev_input?

typedef struct
{
	float Re,
	      Im;

}mycomplex;

int J;
const mycomplex * const dev_input

 
J = RowIdx * Cols + ColIdx + Rows * Cols * i;

So J is stored in local memory, and dev_input seems to be an offset used with pixel data in global memory

Based on the way that pixel data is stored, it seems that all pixel data access would be coalesced, regardless of the pixel being accessed

Regarding your boundary conditions, what bothers me, and something you would perhaps have to check is that, J - 1 and J + 1 may refer to left and right pixels, but I doubt whether the actual address of left and right pixels are as simple as merely J - 1 and J + 1
Rather, you would probably have to shift the address by a pixel block - J - pixel_block; J + pixel_block
Just verify this

Also, it seems as if you wish to store multiple pixels, or store pixels separately, when hitting boundary conditions
Then, shared int myshared[tile_width][tile_width + 2]; and [ty][tx +1] for the centre pixel , [ty][tx] for the left and [ty][tx+2] for the right,
would likely also not suffice
If a pixel fits into [tile_width][tile_width], you would need [tile_width * 3][tile_width * 3] to store left, right and center separately; you would then also reference as [tx + tile_width * (0, 1, 2], [ty + tile_width * (0, 1, 2] I would think
Also, just check this

Hello and thanks for the help. (sorry for the late reply )

So , for boudary conditions will the below be ok?

(J-1) is J_
(J+1) is Jp
Cols is number of columns

if ( 0 == ColIdx ) J_ += Cols;

if ( Cols == ( ColIdx - 1 ) ) Jp -= Cols;

The problem with defining ,

__shared__ int myshared[tile_width*3][tile_width *3]

(tile_width =16 ) is that it uses too much shared memory ( and it throws an error) .

And are you sure about that ? Because I want to use a tile width and just when I want to take a pixel to the left or right I am decreasing or increasing by 1 pixel.
If I use tile_width *3 it means that I want to use an entire tile width for the center ,one for the left and one for the right.
But since I am scanning each pixel I think it is wrong?

Perhaps you should throw out the 2d array, and work with a 1d array instead
You are likely to find that you can calculate pixel address offsets much easier that way

If a pixel is 16 x 16, a new pixel’s address would start at multiples of 256 - 0, 256, 512…
A new row of a pixel would then start at multiples of 16 - 0, 16, 32, 48…
Columns are everything inbetween - 0 - 15; 16 - 31; …

If I interpret your J_ and JP correctly, you are merely shifting the row of the same pixel, instead of shifting by a whole pixel

It is not clear whether you want to be able to store left, right, center simultaneously, or only one of left, right, center at a time
For the former you would need space for 3 pixels (tile_width * tile_width * 3), and if you run out of shared memory, you may need to redesign your problem/ solution
In the latter case, you only need space for 1 pixel (tile_width * tile_width), and you do not need to offset your shared memory addresses, like you have done here:
myshared[ty][tx + 1], myshared[ty][tx], myshared[ty][tx + 2]
You may hit border cases only occasionally, but at that time you would need 3 * pixel space, if you wish to store left, right, center simultaneously, or at least 2 * pixel space, to store center and left, or center and right simultaneously

I have an image which is 80 x 64 pixels (row x col).(I have not a ppixel 16 x 16).

Inside that area I am scanning pixel by pixel.

So , if I go outside that area (or to be precisely outside tile_width area) I am dealing then with boundary conditions.So , I am scanning (storing) only one pixel at a time.

You say that I don’t need to offset my shared memory but I can’t understand that.I must do some kind of an offset since when I hit boundaries I am going outside tile_width area.

I guess I got lost

So, to be clear:
I take it that you wish to process an (one) image at a time?
And the boundary conditions pertain to the boundaries of the image, or to each pixel within the image?
The image seems too large to fit 1 thread block; I recall you mentioned 20 blocks, with 256 threads per block; So, what is the dimension of a pixel again?
If the dimension of a pixel is merely 1 x 1, why do you use 256 threads per block, and 20 blocks?

Ok,

Yes , I am processing one image at a time , and using the for loop (previous post , from i = 0; i < N;) I am doing that job for N images.

The boundaries conditions pertain to the boundaries of tile width for one image.I have one image with pixel size 80x64 and I am dividing it to tile width (16) size blocks (4x5).And I am processing each tile width block.If a pixel goes out of this block (left or right , up or down) then I am applying the boundaries (as I wrote above only for left and right).

So , 4 blocks in x => 416 = 64 (cols) and 5 blocks in y => 516 = 80 (rows).
That’s why I am using 256 threads per block.

const int tile_width = 16;

    dim3 dimGrid( (Cols / tile_width) , (Rows / tile_width) , N );
    dim3 dimBlock( tile_width , tile_width );
I hope it is clear now?

Thank you for helping.

I believe we are making progress

Before your thread block can start processing a tile block, it needs to read the particular tile block of image n (0 - N - 1) from global memory, I presume

Now, how is your image and its pixels stored in global memory - simply by pixel row (or column) major, I again presume…?

[P = Pixel]; [R = Row]; [C = Col]

P(R1, C1), P(R1, C2), P(R1, C3), … P(R1, C64), P(R2, C1), P(R2, C2), P(R2, C3), … P(R2, C64),
P(R3, C1), P(R3, C2), P(R3, C3), … P(R3, C64),…
P(R80, C1), P(R80, C2), P(R80, C3), … P(R80, C64)

Like so?

Running with the above mentioned assumption that the pixel and thus tile data is stored in global memory as row-major:

The following should be sufficient to store a tile, with boundaries, in shared memory:

tile_in_shared[tile_width * tile_width]
boundaries_in_shared[tile_width * 2] // for left and right only, not top/ bottom

or you could amalgamate:
tile_with_boundaries_in_shared[tile_width * (tile_width + 2)]

To read a tile from global then:

tile_gbl_offsetA = blockIdx.y * tile_width * columns;
tile_gbl_offsetB = blockIdx.x * tile_width;
tile_gbl_offset = tile_gbl_offsetA + tile_gbl_offsetB;
tile_gbl_index = (threadIdx.y * columns) + threadIdx.x;

tile_shared_offset = threadIdx.y * tile_width;
tile_shared_index = threadIdx.x;

tile_in_shared[tile_shared_offset + tile_shared_index] = image_in_gbl[tile_gbl_offset + tile_gbl_index];

boundaries:

if (blockIdx.x == 0)
{
if (threadIdx.x == 0)
{
// left
boundary_in_shared[threadIdx.y * 2] = image_in_gbl[tile_gbl_offset + tile_gbl_index + (columns - 1)];
// right
boundary_in_shared[(threadIdx.y * 2) + 1] = image_in_gbl[tile_gbl_offset + tile_gbl_index + tile_width];
}
}

else if (blockIdx.x == 3)
{
if (threadIdx.x == 0)
{
// left
boundary_in_shared[threadIdx.y * 2] = image_in_gbl[tile_gbl_offset + tile_gbl_index - 1];
// right
boundary_in_shared[(threadIdx.y * 2) + 1] = image_in_gbl[tile_gbl_offset + tile_gbl_index - (columns - 1)];
}
}

else
{
if (threadIdx.x == 0)
{
// left
boundary_in_shared[threadIdx.y * 2] = image_in_gbl[tile_gbl_offset + tile_gbl_index - 1];
// right
boundary_in_shared[(threadIdx.y * 2) + 1] = image_in_gbl[tile_gbl_offset + tile_gbl_index + tile_width];
}
}

Just check

Hello ,

  1. I am a bit lost with all the offsets and the way you compute left and right pixels.

Isn’t a way to deal with my approach? (in order to understand it better)

  1. I tried to run the code using the above but (after running another kernel , not the one that has all of the above) it gives me “an illegal memory access was encountered”.

Thank you!

I also found this http://www.bu.edu/pasi/files/2011/07/Lecture31.pdf.
Trying to understand!