Pixels like RGBA in shared memory array

Flokky · March 4, 2009, 10:44am

Hi guys!

I read my pixel in a shared memory array like this:

s_data_Input[threadIdx.x] = (sI + y*pI) + 4*x;

Now I want to calculate a convolution like this:

for(int i=-KERNEL_RADIUS; i<KERNEL_RADIUS; i++)

{

		//Calculate the convolutions

		convolutionResultR += s_data_Input[threadIdx.x + i];

		convolutionResultG += s_data_Input[threadIdx.x + i];

		convolutionResultB += s_data_Input[threadIdx.x + i];

}

Seems like it won’t work like that… If I only calculate convolutionResultR and put convolutionResultG and convolutionResultB in comment it gives me a red image, logical. Whenever I uncomment convolutionResultR and convolutionResultB I get a white screen.

I suppose something is wrong with the offsets I use.

Now my question is: how can I get the values of R, G, B and A from my pixel?

I’d like to be able to split my float into 4 parts… I tried to cast from float to float4, but that won’t work… Any ideas?

Thanks in advance!

Jamie_K · March 5, 2009, 6:40pm

Your code confuses me. I am guessing you have more than one problem, e.g. is s_data_Input supposed to be pixel data or indices into an array?

In your case i think it may help to try getting it working correctly without shared data first.

perhaps something along the lines of:

#define red(x,y) data[ypitch+4x]
#define green(x,y) data[ypitch+4x+1]
#define blue(x,y) data[ypitch+4x+2]

int centerx = threadId.x + blockIdx.xblockDim.x;
int centery = threadId.y + blockIdx.yblockDim.y;

for(int i=-KERNEL_RADIUS; i<KERNEL_RADIUS; i++)
{
//Calculate the convolutions
convolutionResultR += red(centerx+i, centery);
convolutionResultG += green(centerx+i, centery);
convolutionResultB += blue(centerx+i, centery);
}

jack · March 5, 2009, 9:31pm

Jamie K has the idea…but it will all depend on how your RGBA values are stored. His #define methods will work only if you store your data as a 32-bit value (e.g. uint), in which case the first byte will be the R value, the second will be the G value, and so on. If you’re calculating the data in some other code, and uploading it in a different format, you need to adjust your calls accordingly; also, if you are doing it that way, you should probably convert to the packed values (and use some bitmasks/bitshifts to get whatever value you want – just #define some little functions for each of R,G,B,A) because you’ll get more data into memory at any one time, which almost always works out faster in CUDA.

Flokky · March 9, 2009, 9:42am

s_data_Input is indeed an array in shared memory with pixel data.

I have a working version without shared memory already, it’s just the implementation in shared memory where I’m stuck…

perhaps something along the lines of:

define red(x,y) data[ypitch+4x]

define green(x,y) data[ypitch+4x+1]

define blue(x,y) data[ypitch+4x+2]

int centerx = threadId.x + blockIdx.x*blockDim.x;

int centery = threadId.y + blockIdx.y*blockDim.y;

for(int i=-KERNEL_RADIUS; i<KERNEL_RADIUS; i++)

{
    //Calculate the convolutions

    convolutionResultR += red(centerx+i, centery);

    convolutionResultG += green(centerx+i, centery);

    convolutionResultB += blue(centerx+i, centery);
}

Well the point is that I need offsets in my array indeed… Like you add 0, 1 and 2 in your data-array. Thing is: whenever I execute my code I get a white screen… After some adjustments I have following code for my Row Convolution Filter:

__global__ void ConvolutionRowKernel(unsigned char* surfaceOutput, size_t pitchOutput, unsigned char* surfaceInput, size_t pitchInput, int width, int height)

{

	float s_data_R;

	float s_data_G;

	float s_data_B;

	

	__shared__ float4 s_data_Input[KERNEL_RADIUS + BLOCK_DIM + KERNEL_RADIUS];

	int x = blockIdx.x * blockDim.x + threadIdx.x;

	int y = blockIdx.y * blockDim.y + threadIdx.y;

	

	//Read into shared memory

	if((x < width) && (y < height))

	{	

		float sI = (float)*surfaceInput;

		float pI = pitchInput;

		//Each thread loads one pixel from global to shared memory

		s_data_Input[threadIdx.x].x = (sI + y*pI) + 4*x; //R

		s_data_Input[threadIdx.x].y = (sI + y*pI) + 4*x+1; //G

		s_data_Input[threadIdx.x].z = (sI + y*pI) + 4*x+2; //B

		//Make sure the loading stage is completed

		__syncthreads();

		float convolutionResultR = 0.0f;

		float convolutionResultG = 0.0f;

		float convolutionResultB = 0.0f;

		for(int i=-KERNEL_RADIUS; i<KERNEL_RADIUS; i++)

		{	

			convolutionResultR += s_data_Input[threadIdx.x+i].x; //R

			convolutionResultG += s_data_Input[threadIdx.x+i].y; //G

			convolutionResultB += s_data_Input[threadIdx.x+i].z; //B

		}

		s_data_R = convolutionResultR / KERNEL_W;

		s_data_G = convolutionResultG / KERNEL_W;

		s_data_B = convolutionResultB / KERNEL_W;

		//write the results to global memory

		float* pixelOutput = (float *)(surfaceOutput + y*pitchOutput) + 4*x;

		pixelOutput[0] = s_data_R;

		pixelOutput[1] = s_data_G;

		pixelOutput[2] = s_data_B;

		pixelOutput[3] = 1.0f;

	}

}

I really don’t get what’s wrong…

Thanks again!

Jamie_K · March 9, 2009, 1:11pm

Previously I think your threads were stepping on each other, because red for thread 1 would store to the same location as greeen for thread 0. But that’s fixed now that you’re using float4.

It looks like your calculation with sI will do arithmetic on the pixel value pointed to by surfaceInput, and not do addressing into the image. Make sI a pointer:

float *sI = (float *)(surfaceInput+y*pI);

And then access like this:

s_data_Input[threadIdx.x].x = sI[4*x]; //R

s_data_Input[threadIdx.x].y = sI[4*x+1]; //G

s_data_Input[threadIdx.x].z = sI[4*x+2]; //B

Another problem which is a little more involved is that with BLOCKDIM threads, only BLOCKDIM pixels will be copied from global memory to shared memory, but you need BLOCKDIM + 2*KERNEL_RADIUS pixels. So at least some threads will need to load more than one pixel. You could try something like this:

for (int i = threadIdx.x; i < BLOCKDIM + 2*KERNEL_RADIUS; i += BLOCKDIM) {

	int gx = blockIdx.x * blockDim.x - KERNEL_RADIUS + i;

	s_data_Input[i].x = sI[4*gx]; //R

	s_data_Input[i].y = sI[4*gx+1]; //G

	s_data_Input[i].z = sI[4*gx+2]; //B

}

__syncthreads();

Then for the convolution, use the loop:

for(int i=0; i <= 1+2*KERNEL_RADIUS; i++) {

	...

}

These changes might not get you 100% there, and I may have made some mistakes, but they will get you much closer.

Flokky · March 9, 2009, 8:53pm

First of all: thanks for your help!

Yes indeed, noticed that too ;)

It looks like your calculation with sI will do arithmetic on the pixel value pointed to by surfaceInput, and not do addressing into the image. Make sI a pointer:
float *sI = (float *)(surfaceInput+y*pI);

That’s something that I’ve been struggling with… I have to use shared memory as much as possible, but this way I still point to my global memory? I tried to put something like:

float sI = something

but I don’t really know how…

Another problem which is a little more involved is that with BLOCKDIM threads, only BLOCKDIM pixels will be copied from global memory to shared memory, but you need BLOCKDIM + 2*KERNEL_RADIUS pixels. So at least some threads will need to load more than one pixel. You could try something like this:
for (int i = threadIdx.x; i < BLOCKDIM + 2*KERNEL_RADIUS; i += BLOCKDIM) {

	int gx = blockIdx.x * blockDim.x - KERNEL_RADIUS + i;

	s_data_Input[i].x = sI[4*gx]; //R

	s_data_Input[i].y = sI[4*gx+1]; //G

	s_data_Input[i].z = sI[4*gx+2]; //B

}

__syncthreads();

That would’ve been the next step in my optimalisation, thanks ;)

Thanks again for your explanation!

Grtz!

Flokky

Topic		Replies	Views
Dynamically allocated shared memory CUDA Programming and Performance	5	3920	February 12, 2009
2D Convolution Coalescing problems CUDA Programming and Performance	8	6131	May 2, 2009
Shared memory out of bounds (simple convolution) CUDA Programming and Performance	6	690	June 21, 2017
Separable Convolution and Shared Memory CUDA Programming and Performance	3	2463	January 20, 2017
Convolution Texture with Shared Memory CUDA Programming and Performance	3	479	April 15, 2024
help getting shared memory working CUDA Programming and Performance	11	4289	June 12, 2007
Shared Memory Limitation CUDA Programming and Performance	8	4782	February 4, 2009
convolution using shared memory slowdown instead of speedup... CUDA Programming and Performance	1	5578	March 11, 2010
Performance issues on memory transfer CUDA Programming and Performance	13	12981	November 26, 2010
Shared vs. texture memory CUDA Programming and Performance	6	3178	April 18, 2009

Pixels like RGBA in shared memory array

Related topics