Problems writing ConvolutionResult back to global memory Optimizing CUDA by coalescing

Flokky · March 24, 2009, 12:12pm

My ConvolutionColumnKernel looks like:

__global__ void ConvolutionColumnKernel(unsigned char* surfaceOutput, size_t pitchOutput, unsigned char* surfaceInput, size_t pitchInput, int width, int height)

{

	__shared__ uchar4 s_data_Input[COLUMN_TILE_W * (KERNEL_RADIUS + COLUMN_TILE_H + KERNEL_RADIUS)];

	int x = blockIdx.x * blockDim.x + threadIdx.x;

	int y = blockIdx.y * blockDim.y + threadIdx.y;

	int smemPos = IMUL(threadIdx.y + KERNEL_RADIUS, COLUMN_TILE_W) + threadIdx.x;

	//Read into shared memory

	if((x < width) && (y < height))

	{

		unsigned char* pixelInput = (surfaceInput + y*pitchInput);

		s_data_Input[threadIdx.x].x = pixelInput[smemPos + IMUL(x, COLUMN_TILE_W)]; //R

		s_data_Input[threadIdx.x].y = pixelInput[smemPos + IMUL(x+1, COLUMN_TILE_W)]; //G

		s_data_Input[threadIdx.x].z = pixelInput[smemPos + IMUL(x+2, COLUMN_TILE_W)]; //B

		__syncthreads();

		

		uchar3 convolutionResult;

		for(int i=0; i<=KERNEL_W; i++)

		{	

			convolutionResult.x += s_data_Input[i].x;

			convolutionResult.y += s_data_Input[i].y;

			convolutionResult.z += s_data_Input[i].z;

		}

		convolutionResult.x /= KERNEL_W;

		convolutionResult.y /= KERNEL_W;

		convolutionResult.z /= KERNEL_W;

		uchar4* pixelOutput;

		pixelOutput[threadIdx.x].x = convolutionResult.x;

		pixelOutput[threadIdx.x].y = convolutionResult.y;

		pixelOutput[threadIdx.x].z = convolutionResult.z;

		pixelOutput[threadIdx.x].w = 1.0;

	}

}

Originally my code looked like this, and it worked fine:

unsigned char* pixelOutput = (unsigned char*) (surfaceOutput + y*pitchOutput) + 4*x;

pixelOutput[0] = convolutionResult.x;

pixelOutput[1] = convolutionResult.y;

pixelOutput[2] = convolutionResult.z;

pixelOutput[3] = 1.0;

Now I tried to coalesce my code like this:

uchar4* pixelOutput;

		

pixelOutput[threadIdx.x].x = convolutionResult.x;

pixelOutput[threadIdx.x].y = convolutionResult.y;

pixelOutput[threadIdx.x].z = convolutionResult.z;

pixelOutput[threadIdx.x].w = 1.0;

Unfortunately it won’t work… Anyone knows what I’m doing wrong? I’m really stuck here… Every time I execute, my PC stalls External Image

Thanks in advance!

kristleifur · March 24, 2009, 1:15pm

There’s no allocation to that address, you’ve got a pointer to random space and are writing into it

As far as I can see!

Flokky · March 24, 2009, 1:37pm

I tried something like this, but then it gives me a very strange output…

uchar4* pixelOutput = (uchar4*)(surfaceInput + y*pitchInput);

I just want the R value in pixelOutput[threadIdx.x].x, G value in pixelOutput[threadIdx.x].y, B value in pixelOutput[threadIdx.x].z and the A value in pixelOutput[threadIdx.x].w, but I’m confused by the pointer to global memory…

Would be nice if you could help me out here…

Thanks again!

Topic		Replies	Views
2D Convolution Coalescing problems CUDA Programming and Performance	8	6194	May 2, 2009
Pixels like RGBA in shared memory array CUDA Programming and Performance	5	3057	March 9, 2009
Help with coalescing CUDA Programming and Performance	0	2726	March 12, 2008
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10236	June 28, 2009
uchar4 CUDA Programming and Performance	0	5439	March 16, 2009
Dynamically allocated shared memory CUDA Programming and Performance	5	3954	February 12, 2009
Shared memory question CUDA Programming and Performance	27	7475	June 23, 2008
Coalesced shared memory access? Read and write from which thread to which? CUDA Programming and Performance	23	2583	May 22, 2014
global memory coalescing data accessing problem CUDA Programming and Performance	0	1082	July 31, 2008
gst_incoherent CUDA Programming and Performance	5	4288	March 24, 2008

Problems writing ConvolutionResult back to global memory Optimizing CUDA by coalescing

Related topics