I’m not an expert on shared mem issues. But I think in your code is room enough to make things better.

First, shared mem is shared among all threads of a block. So in my opinion initializing the matrix by all thredas makes no sense. It may be more sensible to embrace the matrix initialization in an “if( thread.x == thread.y == 0 )”-block.

I can’t figure out the sense of your transform! What is point? Something like float2 ? Then you have to multiply a 2x2-matrix with the point. If I’m not completely wrong I would expect a code similar to this:

__global__
void transform( point *arrD, int width, int height )
{
//transform matrix (shared memory)
if( thread.x == thread.y == 0 )
{
m[0] = 1.0; m[1] = 0.0;
m[2] = 0.0; m[3] = 1.0;
}
unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row<height && col<width)
{
float x = arrD[row*width+col].x;
float y = arrD[row*width+col].y;
arrD[row*width+col].x = x*m[0] + y*m[1];
arrD[row*width+col].y = x*m[2] + y*m[3];
}
}

Initializing the matrix by all threads was stupid :) Thank you!

This transformation matrix will be finally 2 by 3 only… it is image transform matrix for example for a rotation, scale, skew, …

but this one … static one… is only for testing, it is the identity matrix and the image has to stay unchanged after the multiplication…

And one dimension is useless, because the result after the multiplication with that dimension will be always 1… but that’s not the main point right now.

So, what I will probably have to do is:

Load a tile from the big image array to the shared memory, than do the multiplication with transfrom matrix and than, copy the result

to the global memory - to the big array.

Do you think I’m thinking right?

so I will have to recount which index from the computed tile corresponds to the index in the big array with all image coordinates [x,y].

From the parallel point of view there’s no need to load wholes tiles. CUDA’s threading machine allows you to spend one thread per pixel. If you want to optimize your code due to coalesced memory access you have to think about tiles. You can simply map a tile to a thread block using threadIdx within a block mapping to global thread ID:

unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;
smem[threadIdx.y*blockDim.x + threadIdx.x] = arrD[row*width + col];

Yes. Your kernel is most certainly memory bandwidth bound having only 2 float reads and 2 float writes in each thread with very little math. There is nothing you can do to make it go faster other than getting fully coalesced loads and stores.

Shared memory is only useful when many threads in a block will be sharing the data. Textures are only useful if you have a semi-random memory read pattern that cannot be coalesced.