Shared vs. texture memory

Huh, can’t change the topic name :) No texture memory for my problem…

Anyway…

I’m optimizing this code:

__shared__ float m[9];

__global__

void transform( point *arrD, int width, int height )

{

	   //transform matrix (shared memory)

	  m[0] = 1.0;		m[1] = 0.0;		m[2] = 0.0;

	m[3] = 0.0;		m[4] = 1.0;		m[5] = 0.0;

	m[6] = 0.0;		m[7] = 0.0;		m[8] = 1.0;

	

	

	unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;

	unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;

		

	if(row<height && col<width)

	{

		arrD[row*width+col].x = col*m[0] + row*m[1] + m[2];

		arrD[row*width+col].y = col*m[3] + row*m[4] + m[5];

	}

}

This kernel is supposed to fill out the array arrD with transformed coordinates by the matrix m.

  • arrD resides now in the global memory, which is slow and matrix m is in the shared memory

  • arrD is for the image coordinates, so the size is image->width * image->height

Now, what’s the best way to optimize this, if I need to, let’s say, touch every element of arrD only once?

Can’t figure out how to use the shared memory for this…

Hi!

I’m not an expert on shared mem issues. But I think in your code is room enough to make things better.

First, shared mem is shared among all threads of a block. So in my opinion initializing the matrix by all thredas makes no sense. It may be more sensible to embrace the matrix initialization in an “if( thread.x == thread.y == 0 )”-block.

I can’t figure out the sense of your transform! What is point? Something like float2 ? Then you have to multiply a 2x2-matrix with the point. If I’m not completely wrong I would expect a code similar to this:

__global__

void transform( point *arrD, int width, int height )

{

	   //transform matrix (shared memory)

		if( thread.x == thread.y == 0 )

		{

	  m[0] = 1.0;		m[1] = 0.0;

	  m[2] = 0.0;		m[3] = 1.0;

		}

	

	

	unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;

	unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;

		

	if(row<height && col<width)

	{

		   float x = arrD[row*width+col].x;

		   float y = arrD[row*width+col].y;

		   arrD[row*width+col].x = x*m[0] + y*m[1];

	   arrD[row*width+col].y = x*m[2] + y*m[3];

	}

}

Hi,

thanks for looking at it.

Initializing the matrix by all threads was stupid :) Thank you!

This transformation matrix will be finally 2 by 3 only… it is image transform matrix for example for a rotation, scale, skew, …

but this one … static one… is only for testing, it is the identity matrix and the image has to stay unchanged after the multiplication…

And one dimension is useless, because the result after the multiplication with that dimension will be always 1… but that’s not the main point right now.

So, what I will probably have to do is:

Load a tile from the big image array to the shared memory, than do the multiplication with transfrom matrix and than, copy the result

to the global memory - to the big array.

Do you think I’m thinking right?

  • so I will have to recount which index from the computed tile corresponds to the index in the big array with all image coordinates [x,y].

Hard to imagine this for me…

Hi!

From the parallel point of view there’s no need to load wholes tiles. CUDA’s threading machine allows you to spend one thread per pixel. If you want to optimize your code due to coalesced memory access you have to think about tiles. You can simply map a tile to a thread block using threadIdx within a block mapping to global thread ID:

unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;

	unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;

	smem[threadIdx.y*blockDim.x + threadIdx.x] = arrD[row*width + col];

Hi,

thanks a lot for answers!

I tried a few things and this code is the fastest.

__global__ void transf( point*arrD, int width, int height )

{

	 //test matrix - image coordinates will be the same...

	if( threadIdx.x == 0 && threadIdx.y == 0 )

	{

		m[0] = 1.0;		m[1] = 0.0;		m[2] = 0.0;

		m[3] = 0.0;		m[4] = 1.0;		m[5] = 0.0;

	}

	

		

	//x = cols, y = rows

	unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;

	unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;

	

	//storing the result to the array which is in the Global memory

	if(row<height && col<width)

	{

		arrD[row*width+col].x = col*m[0] + row*m[1] + m[2];

		arrD[row*width+col].y = col*m[3] + row*m[4] + m[5];

	}

	

	/*COMENTED SEQUENCE... tried this too, but it was much slower

	if(row<height && col<width)

	{

				// arrZ - SHARED MEMORY

		arrZ[threadIdx.y*blockDim.x + threadIdx.x].x = col*m[0] + row*m[1] + m[2];

		arrZ[threadIdx.y*blockDim.x + threadIdx.x].y = col*m[3] + row*m[4] + m[5];

	}

	__syncthreads();

		//the last thread of each block stores the data to the global memory

	if((blockDim.x*blockDim.y)==(threadIdx.y*blockDim.x + threadIdx.x))

	{

		for(int u=0; u<(blockDim.y); u++)

		for(int t=0; t<(blockDim.x); t++)

		{

			if(((blockIdx.y*blockDim.y+u)*width+(blockIdx.x * blockDim.x + t)) < (width*height))

			{

				arrD[(blockIdx.y*blockDim.y+u)*width+(blockIdx.x * blockDim.x + t)].x = arrZ[u*blockDim.x+t].x;

				arrD[(blockIdx.y*blockDim.y+u)*width+(blockIdx.x * blockDim.x + t)].y = arrZ[u*blockDim.x+t].y;

			}

		}

	}

	*/

	

	

}

Visual Profiler says, that 56% of the time is memset and the rest is kernel. So this is probably the max speed I can get.

Am I right please? :ph34r:

Yes. Your kernel is most certainly memory bandwidth bound having only 2 float reads and 2 float writes in each thread with very little math. There is nothing you can do to make it go faster other than getting fully coalesced loads and stores.

Shared memory is only useful when many threads in a block will be sharing the data. Textures are only useful if you have a semi-random memory read pattern that cannot be coalesced.

Great!

Thanks for answers guys :)