Pixels like RGBA in shared memory array

Hi guys!

I read my pixel in a shared memory array like this:

s_data_Input[threadIdx.x] = (sI + y*pI) + 4*x;

Now I want to calculate a convolution like this:



		//Calculate the convolutions

		convolutionResultR += s_data_Input[threadIdx.x + i];

		convolutionResultG += s_data_Input[threadIdx.x + i];

		convolutionResultB += s_data_Input[threadIdx.x + i];


Seems like it won’t work like that… If I only calculate convolutionResultR and put convolutionResultG and convolutionResultB in comment it gives me a red image, logical. Whenever I uncomment convolutionResultR and convolutionResultB I get a white screen.

I suppose something is wrong with the offsets I use.

Now my question is: how can I get the values of R, G, B and A from my pixel?

I’d like to be able to split my float into 4 parts… I tried to cast from float to float4, but that won’t work… Any ideas?

Thanks in advance!

Your code confuses me. I am guessing you have more than one problem, e.g. is s_data_Input supposed to be pixel data or indices into an array?

In your case i think it may help to try getting it working correctly without shared data first.

perhaps something along the lines of:

#define red(x,y) data[ypitch+4x]
#define green(x,y) data[ypitch+4x+1]
#define blue(x,y) data[ypitch+4x+2]

int centerx = threadId.x + blockIdx.xblockDim.x;
int centery = threadId.y + blockIdx.y

//Calculate the convolutions
convolutionResultR += red(centerx+i, centery);
convolutionResultG += green(centerx+i, centery);
convolutionResultB += blue(centerx+i, centery);

Jamie K has the idea…but it will all depend on how your RGBA values are stored. His #define methods will work only if you store your data as a 32-bit value (e.g. uint), in which case the first byte will be the R value, the second will be the G value, and so on. If you’re calculating the data in some other code, and uploading it in a different format, you need to adjust your calls accordingly; also, if you are doing it that way, you should probably convert to the packed values (and use some bitmasks/bitshifts to get whatever value you want – just #define some little functions for each of R,G,B,A) because you’ll get more data into memory at any one time, which almost always works out faster in CUDA.

s_data_Input is indeed an array in shared memory with pixel data.

I have a working version without shared memory already, it’s just the implementation in shared memory where I’m stuck…

Well the point is that I need offsets in my array indeed… Like you add 0, 1 and 2 in your data-array. Thing is: whenever I execute my code I get a white screen… After some adjustments I have following code for my Row Convolution Filter:

__global__ void ConvolutionRowKernel(unsigned char* surfaceOutput, size_t pitchOutput, unsigned char* surfaceInput, size_t pitchInput, int width, int height)


	float s_data_R;

	float s_data_G;

	float s_data_B;


	__shared__ float4 s_data_Input[KERNEL_RADIUS + BLOCK_DIM + KERNEL_RADIUS];

	int x = blockIdx.x * blockDim.x + threadIdx.x;

	int y = blockIdx.y * blockDim.y + threadIdx.y;


	//Read into shared memory

	if((x < width) && (y < height))


		float sI = (float)*surfaceInput;

		float pI = pitchInput;

		//Each thread loads one pixel from global to shared memory

		s_data_Input[threadIdx.x].x = (sI + y*pI) + 4*x; //R

		s_data_Input[threadIdx.x].y = (sI + y*pI) + 4*x+1; //G

		s_data_Input[threadIdx.x].z = (sI + y*pI) + 4*x+2; //B

		//Make sure the loading stage is completed


		float convolutionResultR = 0.0f;

		float convolutionResultG = 0.0f;

		float convolutionResultB = 0.0f;

		for(int i=-KERNEL_RADIUS; i<KERNEL_RADIUS; i++)


			convolutionResultR += s_data_Input[threadIdx.x+i].x; //R

			convolutionResultG += s_data_Input[threadIdx.x+i].y; //G

			convolutionResultB += s_data_Input[threadIdx.x+i].z; //B


		s_data_R = convolutionResultR / KERNEL_W;

		s_data_G = convolutionResultG / KERNEL_W;

		s_data_B = convolutionResultB / KERNEL_W;

		//write the results to global memory

		float* pixelOutput = (float *)(surfaceOutput + y*pitchOutput) + 4*x;

		pixelOutput[0] = s_data_R;

		pixelOutput[1] = s_data_G;

		pixelOutput[2] = s_data_B;

		pixelOutput[3] = 1.0f;



I really don’t get what’s wrong…

Thanks again!

Previously I think your threads were stepping on each other, because red for thread 1 would store to the same location as greeen for thread 0. But that’s fixed now that you’re using float4.

It looks like your calculation with sI will do arithmetic on the pixel value pointed to by surfaceInput, and not do addressing into the image. Make sI a pointer:

float *sI = (float *)(surfaceInput+y*pI);

And then access like this:

s_data_Input[threadIdx.x].x = sI[4*x]; //R

s_data_Input[threadIdx.x].y = sI[4*x+1]; //G

s_data_Input[threadIdx.x].z = sI[4*x+2]; //B

Another problem which is a little more involved is that with BLOCKDIM threads, only BLOCKDIM pixels will be copied from global memory to shared memory, but you need BLOCKDIM + 2*KERNEL_RADIUS pixels. So at least some threads will need to load more than one pixel. You could try something like this:

for (int i = threadIdx.x; i < BLOCKDIM + 2*KERNEL_RADIUS; i += BLOCKDIM) {

	int gx = blockIdx.x * blockDim.x - KERNEL_RADIUS + i;

	s_data_Input[i].x = sI[4*gx]; //R

	s_data_Input[i].y = sI[4*gx+1]; //G

	s_data_Input[i].z = sI[4*gx+2]; //B



Then for the convolution, use the loop:

for(int i=0; i <= 1+2*KERNEL_RADIUS; i++) {



These changes might not get you 100% there, and I may have made some mistakes, but they will get you much closer.

First of all: thanks for your help!

Yes indeed, noticed that too ;)

That’s something that I’ve been struggling with… I have to use shared memory as much as possible, but this way I still point to my global memory? I tried to put something like:

float sI = something

but I don’t really know how…

That would’ve been the next step in my optimalisation, thanks ;)

Thanks again for your explanation!


  • Flokky