Dynamically allocated shared memory

Hi! I’m trying to implement the following code in shared memory. But I’m stuck since I don’t know how to work properly with the dynamically allocated shared memory array. Can someone help me out?

Original code (Global Memory only):

__global__ void ConvolutionRowKernel(unsigned char* surfaceOutput, size_t pitchOutput, unsigned char* surfaceInput, size_t pitchInput, int width, int height)

{

	int x = blockIdx.x * blockDim.x + threadIdx.x;

	int y = blockIdx.y * blockDim.y + threadIdx.y;

	if (x >= width || y >= height) return;	

	float* pixelOutput = (float*) (surfaceOutput + y*pitchOutput) + 4*x;

	

	float convolutionResultR = 0.0f;

	float convolutionResultG = 0.0f;

	float convolutionResultB = 0.0f;

	for(int i=-KERNELRADIUS; i<KERNELRADIUS; i++)

	{

		float* pixelInput = (float*) (surfaceInput + y*pitchInput) + 4*(x+i);

		convolutionResultR += pixelInput[0];

		convolutionResultG += pixelInput[1];

		convolutionResultB += pixelInput[2];

	}

	pixelOutput[0] = convolutionResultR / (2*KERNELRADIUS+1);

	pixelOutput[1] = convolutionResultG / (2*KERNELRADIUS+1);

	pixelOutput[2] = convolutionResultB / (2*KERNELRADIUS+1);

	pixelOutput[3] = 1.0f; 

}

Optimized code (Shared Memory):

__global__ void ConvolutionRowKernel(unsigned char* surfaceOutput, size_t pitchOutput, unsigned char* surfaceInput, size_t pitchInput, int width, int height)

{

	//Shared memory variable declaration

	__shared__ float s_data_R;

	__shared__ float s_data_G;

	__shared__ float s_data_B;

	__shared__ long sI_tmp;

	__shared__ float sI;

	__shared__ float pI;

	extern __shared__ float s_data_Input[];

	__shared__ float convolutionResultR;

	__shared__ float convolutionResultG;

	__shared__ float convolutionResultB;

	const int tid = threadIdx.x;

	const int bid = blockIdx.x;

	int x = blockIdx.x * blockDim.x + threadIdx.x;

	int y = blockIdx.y * blockDim.y + threadIdx.y;

	//Read into shared memory

	if((x < width) && (y < height))

	{	

		//Initialize variables that we're going to use

		s_data_R = 0.0f;

		s_data_G = 0.0f;

		s_data_B = 0.0f;

		sI_tmp = 0;

		sI = 0.0f;

		pI = 0.0f;

		convolutionResultR = 0.0f;

		convolutionResultG = 0.0f;

		convolutionResultB = 0.0f;

		for(int i=-KERNELRADIUS; i<KERNELRADIUS; i++)

		{

			//To be able to cast from unsigned char to float we first need to cast to a long

			sI_tmp = (long) surfaceInput;

			sI = (float) sI_tmp;

			pI = (float) pitchInput;

			

			//Read memory in dynamically allocated shared memory (p22 @ programming guide)

			s_data_Input[tid] = ((sI + y*pI) + 4*(x+i));

			//Set pointer into shared memory

			float* s_data_pixelInput0 = (float*)s_data_Input;

			float* s_data_pixelInput1 = (float*)&s_data_pixelInput0;

			//Calculated the convolutions

			convolutionResultR += s_data_pixelInput1[0];

			convolutionResultG += s_data_pixelInput1[1];

			convolutionResultB += s_data_pixelInput1[2];

			//Synchronize to make sure the data is loaded

			__syncthreads();

		}

		s_data_R = convolutionResultR / (2*KERNELRADIUS+1);

		s_data_G = convolutionResultG / (2*KERNELRADIUS+1);

		s_data_B = convolutionResultB / (2*KERNELRADIUS+1);

		__syncthreads();

		//write the results to global memory

		float* pixelOutput = (float*)(surfaceOutput + y*pitchOutput) + 4*x;

		pixelOutput[0] = s_data_R;

		pixelOutput[1] = s_data_G;

		pixelOutput[2] = s_data_B;

		pixelOutput[3] = 1.0f;

	}

	else

	{

		return;

	}

}

So following code isn’t what it should be:

//Read memory in dynamically allocated shared memory (p22 @ programming guide)

			s_data_Input[tid] = ((sI + y*pI) + 4*(x+i));

			//Set pointer into shared memory

			float* s_data_pixelInput0 = (float*)s_data_Input;

			float* s_data_pixelInput1 = (float*)&s_data_pixelInput0;

			//Calculated the convolutions

			convolutionResultR += s_data_pixelInput1[0];

			convolutionResultG += s_data_pixelInput1[1];

			convolutionResultB += s_data_pixelInput1[2];

Anyone who can help me out?

Thanks in advance!

Is KERNELRADIUS a constant or does it change within the algorithm? If it’s constant, I don’t see why you’d need dynamically allocated shared memory.

s_data_Input is shared between threads. That means when you access element 0 in your code, all threads try to get there. You get a race condition.

Im trying to listen to a class at the same time so maybe im not reading your code right but there seem to be many things wrong with it.

Firstly, every thread in the block reads its own ((sI + ypI) + 4(x+i)) but they all put it in [0] of the shared memory array. You dont syncthread after so im not even sure whats goin to happen there. What you probably want is to store in [threadIdx.x] or if youre treating one pixel in your loop at the time, then you only want threadIdx.x==0 to LOAD the data into shared memory, im not sure which one youre trying to achieve.

Also, youre reading s_data_pixelInput1[1] and [2] but they have not been loaded as far as i can tell and there is definitely no syncthread before the reads.

convolutionResultR,G,B are shared… im not sure why. you most likely want each thread (one thread per pixel?) to perform its own convolution? If so, those variables should be private to the thread. Same for s_data_R,G,B

Hope i didnt read your problem all wrong!

Well that’s a typo, should be [tid] indeed ;) Don’t know how that [0] came there :/

Well what I want to achief is that every pixel is read with one thread and then perform a convolution on each part of that pixel (R,G,B). The reason that convolutionResultR,G,B are shared is because I have to make sure my program is as fast as possible, I avoid reads into global ;)

That’s why I use the dynamically allocated shared memory. If I use the method like in the first example, I still read from my global memory with this pointer:

float* pixelInput = (float*) (surfaceInput + y*pitchInput) + 4*(x+i);

So I want to avoid that, making use of the dynamically allocated shared memory… Any ideas?

Thanks again!

You need a syncthread after that load in s_data_Input[tid] = ((sI + ypI) + 4(x+i)); since after that, all threads of the block will need the elements in shared memory.

convolutionResultR,G,B cannot be shared as they are … private(!) to a given pixel, therefore thread. If you declare them without the shared qualifier they will be put into registers, which is just as fast. You just have way too many shared qualifiers in there and you seem to have missed the fact that there is also a register file. Its not just global mem or shared mem, its global mem, shared mem and registers.

So what you want to do is load all the pixels of a block in the shared array (before your loop) and then do all the computations for one pixel using registers AND the values that you have loaded in the shared mem. That way, every pixel is only read once from global memory, which is what you want to achieve. Well, somewhat, more on that later!

Also, since KERNELRADIUS seems to be a #define, you dont even need to use dynamically allocated shared memory. You can use statically allocated shared memory.
A simple shared float sharedPixels[KERNELRADIUS] would do. As bigmac has already stated.

Now, another problem of your approach is that you cannot only load KERNELRADIUS values in shared memory, since those values are what is needed by 1 pixel, but the pixel next to it will also need a column of pixel you have not loaded into shared memory. You could load (KERNELRADIUS*2) values into shared memory, that way you could treat KERNELRADIUS pixels using that shared array.

OK, I almost got it!

The only thing I still need to know is how to get the separate values of the R, G, B and A float out of my array?

Now the full pixel is loaded into s_data_Input[tid] doing:

s_data_Input[tid] = ((sI + y*pI) + 4*(x+i));

Knowing that 1 pixel contains 4 floats representing R, G, B and A, I need pixel[0] that corresponds with R, pixel[1] that corresponds with G, pixel[2] that corresponds with B and finally pixel[3] that corresponds with A.

Thanks again!