Same code 10 times slower with CUDA 4.0 than with CUDA 3.0

I’ve finally changed from CUDA 3.0 to CUDA 4.0, the same code compiles and works, but it is now 10 times slower than before?

I use the following compiler flags

CFLAGS=’–compiler-bindir /opt/gcc34 -maxrregcount=19 -arch=sm_20 -prec-sqrt=true --ptxas-options=-v -Xcompiler “-fPIC -D_GNU_SOURCE -pthread -fexceptions -m64”’

and the compiler gives the following output, I guess that spill stores and spill loads is something bad? (this compiler information was not given with CUDA 3.0)

ptxas info : Compiling entry function ‘_Z34Convolution_2D_Enhancement_Valid_ZPfS_S_S_S_S_S_S_S_S_S_S_iiiiiiiiiiiiiiif’ for ‘sm_20’
ptxas info : Function properties for _Z34Convolution_2D_Enhancement_Valid_ZPfS_S_S_S_S_S_S_S_S_S_S_iiiiiiiiiiiiiiif
520 bytes stack frame, 3040 bytes spill stores, 24060 bytes spill loads
ptxas info : Used 19 registers, 16384+0 bytes smem, 192 bytes cmem[0], 11056 bytes cmem[2]

I think you don’t need -maxrregcount=19 because you use 16KB shared memory per block, so you have at most three blocks per SM.

If size of thread-block is 256, then you can use 42 registers per thread to achieve 768 active threads per SM.

Spill loads and stores refer to register spilling to local memory. Assuming the spill statistics are accurate, with this much register spilling I am not surprised that performance tanked. We added a lot more detail to the compiler statistics in 4.0. Local memory usage for spilling was previously subsumed under a general “lmem” statistic. This was no longer appropriate in the presence of an ABI that can also make use of local memory.

Just to confirm a few things: You state you are comparing with 3.0, this is not a typo for 3.2? When you compare the performance between 3.0 and 4.0 the compilation in both cases uses the same commandline invocation you showed above? The code is running on exactly the same platform in both cases ?

I don’t have a ready hypothesis as to what could be going on, and the usual code comparisons I do are between consecutive releases, so I don’t have experience in comparing CUDA 3.0 with CUDA 4.0. Various things have changed since 3.0, for example we have added an ABI which is used by default when compiling for sm_20 targets.

Here are a couple of quick experiments you could try that may give some idea as to what could be going on:

(1) How much do you have to increase the -maxrregcount argument before spilling disappears ?
(2) What happens to the spill statistics when you compile with -prec-sqrt=false ?
(3) What happens to the spill statistics when you compile with -abi=no ?

Yes I changed from 3.0 (not 3.2) to 4.0. The only thing that I changed in the compiler commandline was “-m32” to “-m64” (with “-m32” I got a lot of “bad register name” errors). Can the “-m64” flag result in that pointers become 64-bit pointers, instead of 32-bit pointers, and thereby use 2 registers instead of 1? The kernel uses rather many pointers when it’s called

__global__ void Convolution_2D_Enhancement_Valid_Z(float *Filter_Responses_1, float* Filter_Responses_2, float* Filter_Responses_3, float* Filter_Responses_4, float* Filter_Responses_5, float* Filter_Responses_6, float* Filter_Responses_7, float* Filter_Responses_8, float* Filter_Responses_9, float* Filter_Responses_10, float* Filter_Responses_11, float* Volumes, int z_offset, int t, int DATA_W, int DATA_H, int DATA_D, int DATA_T, int VALID_DATA_W, int VALID_DATA_H, int VALID_DATA_D, int ENHANCEMENT_FILTER_W, int ENHANCEMENT_FILTER_H, int ENHANCEMENT_FILTER_D, int xBlockDifference, int yBlockDifference, int blocksInY, float invBlocksInY)

If I remove maxrregcount I get

ptxas info : Compiling entry function ‘_Z34Convolution_2D_Enhancement_Valid_ZPfS_S_S_S_S_S_S_S_S_S_S_iiiiiiiiiiiiiiif’ for ‘sm_20’

ptxas info : Function properties for _Z34Convolution_2D_Enhancement_Valid_ZPfS_S_S_S_S_S_S_S_S_S_S_iiiiiiiiiiiiiiif

512 bytes stack frame, 2968 bytes spill stores, 8056 bytes spill loads

ptxas info : Used 63 registers, 16384+0 bytes smem, 192 bytes cmem[0], 11056 bytes cmem[2]

with CUDA 3.0 the compiler did, if I remember correctly, not report any usage of local memory (lmem).

No difference if I change to -prec-sqrt=false (the 2D convolution kernel does not use square roots).

nvcc fatal : Unknown option ‘abi’

If I compile for architecture 1.3 instead of 2.0 (and remove maxrregcount) I instead get 33 registers and no local memory…

ptxas info : Compiling entry function ‘_Z34Convolution_2D_Enhancement_Valid_ZPfS_S_S_S_S_S_S_S_S_S_S_iiiiiiiiiiiiiiif’ for ‘sm_13’
ptxas info : Used 33 registers, 16544+16 bytes smem, 11056 bytes cmem[0], 16 bytes cmem[1]
ptxas error : Entry function ‘_Z34Convolution_2D_Enhancement_Valid_ZPfS_S_S_S_S_S_S_S_S_S_S_iiiiiiiiiiiiiiif’ uses too much shared data (0x40a0 bytes + 0x10 bytes system, 0x4000 max)

Sorry about the last item, the flag for turning off the ABI (not recommended other than for experimental purposes!) is -Xptxas -abi=no.

In CUDA, the size of types is the same on the host and the device. So when targeting a 64-bit platform, pointers are 64-bit, and “long” is 64-bit for those 64-bit host platforms where “long” has 64 bits. This applies regardless of the GPU target architecture. To store a 64-bit quantity in registers requires two 32-bit registers.

Since there are no sm_1x devices with > 4GB of memory, the compiler can apply some optimizations to pointers stored in registers that reduce register usage. Since there are sm_2x devices with > 4GB of memory, most of these optimization are no longer applicable to sm_2x targets, so pointers stored in registers pretty much always take up two registers there.

This means that one typically sees a slight increase in register usage when moving code from a 32-bit platform to a 64-bit platform with an sm_1x target, and that there is usually a much more substantial increase in register usage from doing so with an sm_2x target.

Given that you compile for sm_2x on a 64-bit host platform, while previously targeting a 32-bit platform, and the code uses a fair number of pointers based on your previous message, the likely cause of the spills seen is the additional register storage needed for pointers. In addition to pointers already present in the source code itself, the compiler may create additional pointers as induction variables when it optimizes loops that iterate over arrays.

Based on this extremely preliminary analysis, it doesn’t look like a bug is to blame for the growth in register pressure that ultimately leads to spilling. It is unfortunate that there is still spilling even when using the maximum available number of registers (namely 63) on sm_2x. Without seeing the code it is difficult for me to recommend a mitigation strategy. Would it be possible to provide the source code in a self-contained form that allows it to be built here? You could send it attached to a private message if you don’t want to post it publicly. There are no guarantees of course that I’ll be able to come up with a good idea of how to reduce the register pressure. If you could post the code publicly that would have the advantage of being able to leverage the combined experience of the entire CUDA user community here.

It is a kernel for 2D convolution, that is used as a function for non-separable 4D convolution.

__global__ void Convolution_2D_Enhancement_Valid_Z(float* __restrict__ Filter_Responses_1, float* __restrict__ Filter_Responses_2, float* __restrict__ Filter_Responses_3, float* __restrict__ Filter_Responses_4, float* __restrict__ Filter_Responses_5, float* __restrict__ Filter_Responses_6, float* __restrict__ Filter_Responses_7, float* __restrict__ Filter_Responses_8, float* __restrict__ Filter_Responses_9, float* __restrict__ Filter_Responses_10, float* __restrict__ Filter_Responses_11, float* __restrict__ Volumes, int z_offset, int t, int DATA_W, int DATA_H, int DATA_D, int DATA_T, int VALID_DATA_W, int VALID_DATA_H, int VALID_DATA_D, int ENHANCEMENT_FILTER_W, int ENHANCEMENT_FILTER_H, int ENHANCEMENT_FILTER_D, int xBlockDifference, int yBlockDifference, int blocksInY, float invBlocksInY)

{   

	unsigned int blockIdxz = __float2uint_rd(blockIdx.y * invBlocksInY);

	unsigned int blockIdxy = blockIdx.y - __umul24(blockIdxz,blocksInY);

	volatile int x = __umul24(blockIdx.x,blockDim.x / 2) * 3 + threadIdx.x;

	volatile int y = __umul24(blockIdxy ,blockDim.y) * 3 + threadIdx.y;

	volatile int z = __umul24(blockIdxz ,blockDim.z) + threadIdx.z;	

	

 	if ( (x >= (DATA_W + xBlockDifference)) || (y >= (DATA_H + yBlockDifference)) || (z < (ENHANCEMENT_FILTER_D - 1)/2)  || (z >= (DATA_D - (ENHANCEMENT_FILTER_D - 1)/2))  || (((z + z_offset) < 0)) || (((z + z_offset) >= DATA_D)) )

		return;

	// Circular convolution in time

	if ((t < 0))

	{	

		t = DATA_T + t;

	}

	else if (t >= DATA_T)

	{

		t = t - DATA_T;

	}

	

	__shared__ float s_Image[64][64];    

	// Blocks

		

	// 1   2  3  4

	// 5   6  7  8

	// 9  10 11 12

	// 13 14 15 16

	s_Image[threadIdx.y][threadIdx.x] = 0.0f;

	s_Image[threadIdx.y + 16][threadIdx.x] = 0.0f;

	s_Image[threadIdx.y + 32][threadIdx.x] = 0.0f;

	s_Image[threadIdx.y + 48][threadIdx.x] = 0.0f;

	s_Image[threadIdx.y][threadIdx.x + 32] = 0.0f;

	s_Image[threadIdx.y + 16][threadIdx.x + 32] = 0.0f;

	s_Image[threadIdx.y + 32][threadIdx.x + 32] = 0.0f;

	s_Image[threadIdx.y + 48][threadIdx.x + 32] = 0.0f;

	// Read data into shared memory

	

	// First row, blocks 1 + 2

	if ( ((x - 8) >= 0) && ((x - 8) < DATA_W) && ((y - 8) >= 0) && ((y - 8) < DATA_H) )

	{

		s_Image[threadIdx.y][threadIdx.x] = Volumes[Calculate_4D_Index(x - 8,y - 8,z + z_offset,t,DATA_W, DATA_H, DATA_D)];	

	}

	

	// First row, blocks 3 + 4

	if ( ((x + 24) < DATA_W) && ((y - 8) >= 0) && ((y - 8) < DATA_H) )

	{

		s_Image[threadIdx.y][threadIdx.x + 32] = Volumes[Calculate_4D_Index(x + 24,y - 8,z + z_offset,t,DATA_W, DATA_H, DATA_D)];	

	}

	// Second row, blocks 5 + 6

	if ( ((x - 8) >= 0) && ((x - 8) < DATA_W) && ((y + 8) < DATA_H) )

	{

		s_Image[threadIdx.y + 16][threadIdx.x] = Volumes[Calculate_4D_Index(x - 8,y + 8,z + z_offset,t,DATA_W, DATA_H, DATA_D)];	

	}

	// Second row, blocks 7 + 8

	if ( ((x + 24) < DATA_W) && ((y + 8) < DATA_H) )

	{

		s_Image[threadIdx.y + 16][threadIdx.x + 32] = Volumes[Calculate_4D_Index(x + 24,y + 8,z + z_offset,t,DATA_W, DATA_H, DATA_D)];	

	}

	// Third row, blocks 9 + 10

	if ( ((x - 8) >= 0) && ((x - 8) < DATA_W) && ((y + 24) < DATA_H) )

	{

		s_Image[threadIdx.y + 32][threadIdx.x] = Volumes[Calculate_4D_Index(x - 8,y + 24,z + z_offset,t,DATA_W, DATA_H, DATA_D)];	

	}

	// Third row, blocks 11 + 12

	if ( ((x + 24) < DATA_W) && ((y + 24) < DATA_H) )

	{

		s_Image[threadIdx.y + 32][threadIdx.x + 32] = Volumes[Calculate_4D_Index(x + 24,y + 24,z + z_offset,t,DATA_W, DATA_H, DATA_D)];	

	}

	// Fourth row, blocks 13 + 14

	if ( ((x - 8) >= 0) && ((x - 8) < DATA_W) && ((y + 40) < DATA_H) )

	{

		s_Image[threadIdx.y + 48][threadIdx.x] = Volumes[Calculate_4D_Index(x - 8,y + 40,z + z_offset,t,DATA_W, DATA_H, DATA_D)];	

	}

	// Fourth row, blocks 15 + 16		

	if ( ((x + 24) < DATA_W) && ((y + 40) < DATA_H) )

	{

		s_Image[threadIdx.y + 48][threadIdx.x + 32] = Volumes[Calculate_4D_Index(x + 24,y + 40,z + z_offset,t,DATA_W, DATA_H, DATA_D)];	

	}

	

	__syncthreads();

	

	// Only threads inside the image do the convolution, calculate filter responses for 48 x 48 pixels

	if ( (x < VALID_DATA_W) && (y < VALID_DATA_H) )

	{

		int idx = Calculate_3D_Index(x,y,z - (ENHANCEMENT_FILTER_D - 1)/2,VALID_DATA_W,VALID_DATA_H);

		float4 filter_responses = Convolve_11x11_Enhancement_4firstfilters(s_Image, threadIdx.y + 8, threadIdx.x + 8);

		Filter_Responses_1[idx] += filter_responses.x;

		Filter_Responses_2[idx] += filter_responses.y;

		Filter_Responses_3[idx] += filter_responses.z;

		Filter_Responses_4[idx] += filter_responses.w;

		filter_responses = Convolve_11x11_Enhancement_4middlefilters(s_Image, threadIdx.y + 8, threadIdx.x + 8);

		Filter_Responses_5[idx] += filter_responses.x;

		Filter_Responses_6[idx] += filter_responses.y;

		Filter_Responses_7[idx] += filter_responses.z;

		Filter_Responses_8[idx] += filter_responses.w;

		filter_responses = Convolve_11x11_Enhancement_4lastfilters(s_Image, threadIdx.y + 8, threadIdx.x + 8);

		Filter_Responses_9[idx] += filter_responses.x;

		Filter_Responses_10[idx] += filter_responses.y;

		Filter_Responses_11[idx] += filter_responses.z;

	}

	if ( (x < VALID_DATA_W) && ((y + 16) < VALID_DATA_H) )

	{

		int idx = Calculate_3D_Index(x,y + 16,z - (ENHANCEMENT_FILTER_D - 1)/2,VALID_DATA_W,VALID_DATA_H);

		float4 filter_responses = Convolve_11x11_Enhancement_4firstfilters(s_Image, threadIdx.y + 24, threadIdx.x + 8);

		Filter_Responses_1[idx] += filter_responses.x;

		Filter_Responses_2[idx] += filter_responses.y;

		Filter_Responses_3[idx] += filter_responses.z;

		Filter_Responses_4[idx] += filter_responses.w;

		filter_responses = Convolve_11x11_Enhancement_4middlefilters(s_Image, threadIdx.y + 24, threadIdx.x + 8);

		Filter_Responses_5[idx] += filter_responses.x;

		Filter_Responses_6[idx] += filter_responses.y;

		Filter_Responses_7[idx] += filter_responses.z;

		Filter_Responses_8[idx] += filter_responses.w;

		filter_responses = Convolve_11x11_Enhancement_4lastfilters(s_Image, threadIdx.y + 24, threadIdx.x + 8);

		Filter_Responses_9[idx] += filter_responses.x;

		Filter_Responses_10[idx] += filter_responses.y;

		Filter_Responses_11[idx] += filter_responses.z;

	}

	if ( (x < VALID_DATA_W) && ((y + 32) < VALID_DATA_H) )

	{

		int idx = Calculate_3D_Index(x,y + 32,z - (ENHANCEMENT_FILTER_D - 1)/2,VALID_DATA_W,VALID_DATA_H);

		float4 filter_responses = Convolve_11x11_Enhancement_4firstfilters(s_Image, threadIdx.y + 40, threadIdx.x + 8);

		Filter_Responses_1[idx] += filter_responses.x;

		Filter_Responses_2[idx] += filter_responses.y;

		Filter_Responses_3[idx] += filter_responses.z;

		Filter_Responses_4[idx] += filter_responses.w;

		filter_responses = Convolve_11x11_Enhancement_4middlefilters(s_Image, threadIdx.y + 40, threadIdx.x + 8);

		Filter_Responses_5[idx] += filter_responses.x;

		Filter_Responses_6[idx] += filter_responses.y;

		Filter_Responses_7[idx] += filter_responses.z;

		Filter_Responses_8[idx] += filter_responses.w;

		filter_responses = Convolve_11x11_Enhancement_4lastfilters(s_Image, threadIdx.y + 40, threadIdx.x + 8);

		Filter_Responses_9[idx] += filter_responses.x;

		Filter_Responses_10[idx] += filter_responses.y;

		Filter_Responses_11[idx] += filter_responses.z;

	}

	if (threadIdx.x < 16)

	{

		if ( ((x + 32) < VALID_DATA_W) && (y < VALID_DATA_H) )

		{

			int idx = Calculate_3D_Index(x + 32,y,z - (ENHANCEMENT_FILTER_D - 1)/2,VALID_DATA_W,VALID_DATA_H);

			float4 filter_responses = Convolve_11x11_Enhancement_4firstfilters(s_Image, threadIdx.y + 8, threadIdx.x + 40);

			Filter_Responses_1[idx] += filter_responses.x;

			Filter_Responses_2[idx] += filter_responses.y;

			Filter_Responses_3[idx] += filter_responses.z;

			Filter_Responses_4[idx] += filter_responses.w;

			filter_responses = Convolve_11x11_Enhancement_4middlefilters(s_Image, threadIdx.y + 8, threadIdx.x + 40);

	

			Filter_Responses_5[idx] += filter_responses.x;

			Filter_Responses_6[idx] += filter_responses.y;

			Filter_Responses_7[idx] += filter_responses.z;

			Filter_Responses_8[idx] += filter_responses.w;

			filter_responses = Convolve_11x11_Enhancement_4lastfilters(s_Image, threadIdx.y + 8, threadIdx.x + 40);

	

			Filter_Responses_9[idx] += filter_responses.x;

			Filter_Responses_10[idx] += filter_responses.y;

			Filter_Responses_11[idx] += filter_responses.z;

		}

		if ( ((x + 32) < VALID_DATA_W) && ((y + 16) < VALID_DATA_H) )

		{

			int idx = Calculate_3D_Index(x + 32,y + 16,z - (ENHANCEMENT_FILTER_D - 1)/2,VALID_DATA_W,VALID_DATA_H);

			float4 filter_responses = Convolve_11x11_Enhancement_4firstfilters(s_Image, threadIdx.y + 24, threadIdx.x + 40);

			Filter_Responses_1[idx] += filter_responses.x;

			Filter_Responses_2[idx] += filter_responses.y;

			Filter_Responses_3[idx] += filter_responses.z;

			Filter_Responses_4[idx] += filter_responses.w;

			filter_responses = Convolve_11x11_Enhancement_4middlefilters(s_Image, threadIdx.y + 24, threadIdx.x + 40);

	

			Filter_Responses_5[idx] += filter_responses.x;

			Filter_Responses_6[idx] += filter_responses.y;

			Filter_Responses_7[idx] += filter_responses.z;

			Filter_Responses_8[idx] += filter_responses.w;

			filter_responses = Convolve_11x11_Enhancement_4lastfilters(s_Image, threadIdx.y + 24, threadIdx.x + 40);

	

			Filter_Responses_9[idx] += filter_responses.x;

			Filter_Responses_10[idx] += filter_responses.y;

			Filter_Responses_11[idx] += filter_responses.z;

		}

		if ( ((x + 32) < VALID_DATA_W) && ((y + 32) < VALID_DATA_H) )

		{

			int idx = Calculate_3D_Index(x + 32,y + 32,z - (ENHANCEMENT_FILTER_D - 1)/2,VALID_DATA_W,VALID_DATA_H);

			float4 filter_responses = Convolve_11x11_Enhancement_4firstfilters(s_Image, threadIdx.y + 40, threadIdx.x + 40);

			Filter_Responses_1[idx] += filter_responses.x;

			Filter_Responses_2[idx] += filter_responses.y;

			Filter_Responses_3[idx] += filter_responses.z;

			Filter_Responses_4[idx] += filter_responses.w;

			filter_responses = Convolve_11x11_Enhancement_4middlefilters(s_Image, threadIdx.y + 40, threadIdx.x + 40);

	

			Filter_Responses_5[idx] += filter_responses.x;

			Filter_Responses_6[idx] += filter_responses.y;

			Filter_Responses_7[idx] += filter_responses.z;

			Filter_Responses_8[idx] += filter_responses.w;

			filter_responses = Convolve_11x11_Enhancement_4lastfilters(s_Image, threadIdx.y + 40, threadIdx.x + 40);

	

			Filter_Responses_9[idx] += filter_responses.x;

			Filter_Responses_10[idx] += filter_responses.y;

			Filter_Responses_11[idx] += filter_responses.z;

		}

	}

	

	

}
__device__ float4 Convolve_11x11_Enhancement_4firstfilters(float image[64][64], int y, int x)

{

	float pixel; float4 sums;

	sums.x = 0.0f;

	sums.y = 0.0f;

	sums.z = 0.0f;

	sums.w = 0.0f;

pixel = image[y - 5][x - 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[10][10];

    sums.y += pixel * c_Enhancement_Filter_2[10][10];

    sums.z += pixel * c_Enhancement_Filter_3[10][10];

    sums.w += pixel * c_Enhancement_Filter_4[10][10];

    pixel = image[y - 4][x - 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[9][10];

    sums.y += pixel * c_Enhancement_Filter_2[9][10];

    sums.z += pixel * c_Enhancement_Filter_3[9][10];

    sums.w += pixel * c_Enhancement_Filter_4[9][10];

    pixel = image[y - 3][x - 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[8][10];

    sums.y += pixel * c_Enhancement_Filter_2[8][10];

    sums.z += pixel * c_Enhancement_Filter_3[8][10];

    sums.w += pixel * c_Enhancement_Filter_4[8][10];

    pixel = image[y - 2][x - 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[7][10];

    sums.y += pixel * c_Enhancement_Filter_2[7][10];

    sums.z += pixel * c_Enhancement_Filter_3[7][10];

    sums.w += pixel * c_Enhancement_Filter_4[7][10];

    pixel = image[y - 1][x - 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[6][10];

    sums.y += pixel * c_Enhancement_Filter_2[6][10];

    sums.z += pixel * c_Enhancement_Filter_3[6][10];

    sums.w += pixel * c_Enhancement_Filter_4[6][10];

    pixel = image[y + 0][x - 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[5][10];

    sums.y += pixel * c_Enhancement_Filter_2[5][10];

    sums.z += pixel * c_Enhancement_Filter_3[5][10];

    sums.w += pixel * c_Enhancement_Filter_4[5][10];

    pixel = image[y + 1][x - 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[4][10];

    sums.y += pixel * c_Enhancement_Filter_2[4][10];

    sums.z += pixel * c_Enhancement_Filter_3[4][10];

    sums.w += pixel * c_Enhancement_Filter_4[4][10];

    pixel = image[y + 2][x - 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[3][10];

    sums.y += pixel * c_Enhancement_Filter_2[3][10];

    sums.z += pixel * c_Enhancement_Filter_3[3][10];

    sums.w += pixel * c_Enhancement_Filter_4[3][10];

    pixel = image[y + 3][x - 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[2][10];

    sums.y += pixel * c_Enhancement_Filter_2[2][10];

    sums.z += pixel * c_Enhancement_Filter_3[2][10];

    sums.w += pixel * c_Enhancement_Filter_4[2][10];

    pixel = image[y + 4][x - 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[1][10];

    sums.y += pixel * c_Enhancement_Filter_2[1][10];

    sums.z += pixel * c_Enhancement_Filter_3[1][10];

    sums.w += pixel * c_Enhancement_Filter_4[1][10];

    pixel = image[y + 5][x - 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[0][10];

    sums.y += pixel * c_Enhancement_Filter_2[0][10];

    sums.z += pixel * c_Enhancement_Filter_3[0][10];

    sums.w += pixel * c_Enhancement_Filter_4[0][10];

pixel = image[y - 5][x - 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[10][9];

    sums.y += pixel * c_Enhancement_Filter_2[10][9];

    sums.z += pixel * c_Enhancement_Filter_3[10][9];

    sums.w += pixel * c_Enhancement_Filter_4[10][9];

    pixel = image[y - 4][x - 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[9][9];

    sums.y += pixel * c_Enhancement_Filter_2[9][9];

    sums.z += pixel * c_Enhancement_Filter_3[9][9];

    sums.w += pixel * c_Enhancement_Filter_4[9][9];

    pixel = image[y - 3][x - 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[8][9];

    sums.y += pixel * c_Enhancement_Filter_2[8][9];

    sums.z += pixel * c_Enhancement_Filter_3[8][9];

    sums.w += pixel * c_Enhancement_Filter_4[8][9];

    pixel = image[y - 2][x - 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[7][9];

    sums.y += pixel * c_Enhancement_Filter_2[7][9];

    sums.z += pixel * c_Enhancement_Filter_3[7][9];

    sums.w += pixel * c_Enhancement_Filter_4[7][9];

    pixel = image[y - 1][x - 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[6][9];

    sums.y += pixel * c_Enhancement_Filter_2[6][9];

    sums.z += pixel * c_Enhancement_Filter_3[6][9];

    sums.w += pixel * c_Enhancement_Filter_4[6][9];

    pixel = image[y + 0][x - 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[5][9];

    sums.y += pixel * c_Enhancement_Filter_2[5][9];

    sums.z += pixel * c_Enhancement_Filter_3[5][9];

    sums.w += pixel * c_Enhancement_Filter_4[5][9];

    pixel = image[y + 1][x - 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[4][9];

    sums.y += pixel * c_Enhancement_Filter_2[4][9];

    sums.z += pixel * c_Enhancement_Filter_3[4][9];

    sums.w += pixel * c_Enhancement_Filter_4[4][9];

    pixel = image[y + 2][x - 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[3][9];

    sums.y += pixel * c_Enhancement_Filter_2[3][9];

    sums.z += pixel * c_Enhancement_Filter_3[3][9];

    sums.w += pixel * c_Enhancement_Filter_4[3][9];

    pixel = image[y + 3][x - 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[2][9];

    sums.y += pixel * c_Enhancement_Filter_2[2][9];

    sums.z += pixel * c_Enhancement_Filter_3[2][9];

    sums.w += pixel * c_Enhancement_Filter_4[2][9];

    pixel = image[y + 4][x - 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[1][9];

    sums.y += pixel * c_Enhancement_Filter_2[1][9];

    sums.z += pixel * c_Enhancement_Filter_3[1][9];

    sums.w += pixel * c_Enhancement_Filter_4[1][9];

    pixel = image[y + 5][x - 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[0][9];

    sums.y += pixel * c_Enhancement_Filter_2[0][9];

    sums.z += pixel * c_Enhancement_Filter_3[0][9];

    sums.w += pixel * c_Enhancement_Filter_4[0][9];

pixel = image[y - 5][x - 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[10][8];

    sums.y += pixel * c_Enhancement_Filter_2[10][8];

    sums.z += pixel * c_Enhancement_Filter_3[10][8];

    sums.w += pixel * c_Enhancement_Filter_4[10][8];

    pixel = image[y - 4][x - 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[9][8];

    sums.y += pixel * c_Enhancement_Filter_2[9][8];

    sums.z += pixel * c_Enhancement_Filter_3[9][8];

    sums.w += pixel * c_Enhancement_Filter_4[9][8];

    pixel = image[y - 3][x - 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[8][8];

    sums.y += pixel * c_Enhancement_Filter_2[8][8];

    sums.z += pixel * c_Enhancement_Filter_3[8][8];

    sums.w += pixel * c_Enhancement_Filter_4[8][8];

    pixel = image[y - 2][x - 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[7][8];

    sums.y += pixel * c_Enhancement_Filter_2[7][8];

    sums.z += pixel * c_Enhancement_Filter_3[7][8];

    sums.w += pixel * c_Enhancement_Filter_4[7][8];

    pixel = image[y - 1][x - 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[6][8];

    sums.y += pixel * c_Enhancement_Filter_2[6][8];

    sums.z += pixel * c_Enhancement_Filter_3[6][8];

    sums.w += pixel * c_Enhancement_Filter_4[6][8];

    pixel = image[y + 0][x - 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[5][8];

    sums.y += pixel * c_Enhancement_Filter_2[5][8];

    sums.z += pixel * c_Enhancement_Filter_3[5][8];

    sums.w += pixel * c_Enhancement_Filter_4[5][8];

    pixel = image[y + 1][x - 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[4][8];

    sums.y += pixel * c_Enhancement_Filter_2[4][8];

    sums.z += pixel * c_Enhancement_Filter_3[4][8];

    sums.w += pixel * c_Enhancement_Filter_4[4][8];

    pixel = image[y + 2][x - 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[3][8];

    sums.y += pixel * c_Enhancement_Filter_2[3][8];

    sums.z += pixel * c_Enhancement_Filter_3[3][8];

    sums.w += pixel * c_Enhancement_Filter_4[3][8];

    pixel = image[y + 3][x - 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[2][8];

    sums.y += pixel * c_Enhancement_Filter_2[2][8];

    sums.z += pixel * c_Enhancement_Filter_3[2][8];

    sums.w += pixel * c_Enhancement_Filter_4[2][8];

    pixel = image[y + 4][x - 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[1][8];

    sums.y += pixel * c_Enhancement_Filter_2[1][8];

    sums.z += pixel * c_Enhancement_Filter_3[1][8];

    sums.w += pixel * c_Enhancement_Filter_4[1][8];

    pixel = image[y + 5][x - 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[0][8];

    sums.y += pixel * c_Enhancement_Filter_2[0][8];

    sums.z += pixel * c_Enhancement_Filter_3[0][8];

    sums.w += pixel * c_Enhancement_Filter_4[0][8];

pixel = image[y - 5][x - 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[10][7];

    sums.y += pixel * c_Enhancement_Filter_2[10][7];

    sums.z += pixel * c_Enhancement_Filter_3[10][7];

    sums.w += pixel * c_Enhancement_Filter_4[10][7];

    pixel = image[y - 4][x - 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[9][7];

    sums.y += pixel * c_Enhancement_Filter_2[9][7];

    sums.z += pixel * c_Enhancement_Filter_3[9][7];

    sums.w += pixel * c_Enhancement_Filter_4[9][7];

    pixel = image[y - 3][x - 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[8][7];

    sums.y += pixel * c_Enhancement_Filter_2[8][7];

    sums.z += pixel * c_Enhancement_Filter_3[8][7];

    sums.w += pixel * c_Enhancement_Filter_4[8][7];

    pixel = image[y - 2][x - 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[7][7];

    sums.y += pixel * c_Enhancement_Filter_2[7][7];

    sums.z += pixel * c_Enhancement_Filter_3[7][7];

    sums.w += pixel * c_Enhancement_Filter_4[7][7];

    pixel = image[y - 1][x - 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[6][7];

    sums.y += pixel * c_Enhancement_Filter_2[6][7];

    sums.z += pixel * c_Enhancement_Filter_3[6][7];

    sums.w += pixel * c_Enhancement_Filter_4[6][7];

    pixel = image[y + 0][x - 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[5][7];

    sums.y += pixel * c_Enhancement_Filter_2[5][7];

    sums.z += pixel * c_Enhancement_Filter_3[5][7];

    sums.w += pixel * c_Enhancement_Filter_4[5][7];

    pixel = image[y + 1][x - 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[4][7];

    sums.y += pixel * c_Enhancement_Filter_2[4][7];

    sums.z += pixel * c_Enhancement_Filter_3[4][7];

    sums.w += pixel * c_Enhancement_Filter_4[4][7];

    pixel = image[y + 2][x - 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[3][7];

    sums.y += pixel * c_Enhancement_Filter_2[3][7];

    sums.z += pixel * c_Enhancement_Filter_3[3][7];

    sums.w += pixel * c_Enhancement_Filter_4[3][7];

    pixel = image[y + 3][x - 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[2][7];

    sums.y += pixel * c_Enhancement_Filter_2[2][7];

    sums.z += pixel * c_Enhancement_Filter_3[2][7];

    sums.w += pixel * c_Enhancement_Filter_4[2][7];

    pixel = image[y + 4][x - 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[1][7];

    sums.y += pixel * c_Enhancement_Filter_2[1][7];

    sums.z += pixel * c_Enhancement_Filter_3[1][7];

    sums.w += pixel * c_Enhancement_Filter_4[1][7];

    pixel = image[y + 5][x - 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[0][7];

    sums.y += pixel * c_Enhancement_Filter_2[0][7];

    sums.z += pixel * c_Enhancement_Filter_3[0][7];

    sums.w += pixel * c_Enhancement_Filter_4[0][7];

pixel = image[y - 5][x - 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[10][6];

    sums.y += pixel * c_Enhancement_Filter_2[10][6];

    sums.z += pixel * c_Enhancement_Filter_3[10][6];

    sums.w += pixel * c_Enhancement_Filter_4[10][6];

    pixel = image[y - 4][x - 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[9][6];

    sums.y += pixel * c_Enhancement_Filter_2[9][6];

    sums.z += pixel * c_Enhancement_Filter_3[9][6];

    sums.w += pixel * c_Enhancement_Filter_4[9][6];

    pixel = image[y - 3][x - 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[8][6];

    sums.y += pixel * c_Enhancement_Filter_2[8][6];

    sums.z += pixel * c_Enhancement_Filter_3[8][6];

    sums.w += pixel * c_Enhancement_Filter_4[8][6];

    pixel = image[y - 2][x - 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[7][6];

    sums.y += pixel * c_Enhancement_Filter_2[7][6];

    sums.z += pixel * c_Enhancement_Filter_3[7][6];

    sums.w += pixel * c_Enhancement_Filter_4[7][6];

    pixel = image[y - 1][x - 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[6][6];

    sums.y += pixel * c_Enhancement_Filter_2[6][6];

    sums.z += pixel * c_Enhancement_Filter_3[6][6];

    sums.w += pixel * c_Enhancement_Filter_4[6][6];

    pixel = image[y + 0][x - 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[5][6];

    sums.y += pixel * c_Enhancement_Filter_2[5][6];

    sums.z += pixel * c_Enhancement_Filter_3[5][6];

    sums.w += pixel * c_Enhancement_Filter_4[5][6];

    pixel = image[y + 1][x - 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[4][6];

    sums.y += pixel * c_Enhancement_Filter_2[4][6];

    sums.z += pixel * c_Enhancement_Filter_3[4][6];

    sums.w += pixel * c_Enhancement_Filter_4[4][6];

    pixel = image[y + 2][x - 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[3][6];

    sums.y += pixel * c_Enhancement_Filter_2[3][6];

    sums.z += pixel * c_Enhancement_Filter_3[3][6];

    sums.w += pixel * c_Enhancement_Filter_4[3][6];

    pixel = image[y + 3][x - 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[2][6];

    sums.y += pixel * c_Enhancement_Filter_2[2][6];

    sums.z += pixel * c_Enhancement_Filter_3[2][6];

    sums.w += pixel * c_Enhancement_Filter_4[2][6];

    pixel = image[y + 4][x - 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[1][6];

    sums.y += pixel * c_Enhancement_Filter_2[1][6];

    sums.z += pixel * c_Enhancement_Filter_3[1][6];

    sums.w += pixel * c_Enhancement_Filter_4[1][6];

    pixel = image[y + 5][x - 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[0][6];

    sums.y += pixel * c_Enhancement_Filter_2[0][6];

    sums.z += pixel * c_Enhancement_Filter_3[0][6];

    sums.w += pixel * c_Enhancement_Filter_4[0][6];

pixel = image[y - 5][x + 0]; 

    sums.x += pixel * c_Enhancement_Filter_1[10][5];

    sums.y += pixel * c_Enhancement_Filter_2[10][5];

    sums.z += pixel * c_Enhancement_Filter_3[10][5];

    sums.w += pixel * c_Enhancement_Filter_4[10][5];

    pixel = image[y - 4][x + 0]; 

    sums.x += pixel * c_Enhancement_Filter_1[9][5];

    sums.y += pixel * c_Enhancement_Filter_2[9][5];

    sums.z += pixel * c_Enhancement_Filter_3[9][5];

    sums.w += pixel * c_Enhancement_Filter_4[9][5];

    pixel = image[y - 3][x + 0]; 

    sums.x += pixel * c_Enhancement_Filter_1[8][5];

    sums.y += pixel * c_Enhancement_Filter_2[8][5];

    sums.z += pixel * c_Enhancement_Filter_3[8][5];

    sums.w += pixel * c_Enhancement_Filter_4[8][5];

    pixel = image[y - 2][x + 0]; 

    sums.x += pixel * c_Enhancement_Filter_1[7][5];

    sums.y += pixel * c_Enhancement_Filter_2[7][5];

    sums.z += pixel * c_Enhancement_Filter_3[7][5];

    sums.w += pixel * c_Enhancement_Filter_4[7][5];

    pixel = image[y - 1][x + 0]; 

    sums.x += pixel * c_Enhancement_Filter_1[6][5];

    sums.y += pixel * c_Enhancement_Filter_2[6][5];

    sums.z += pixel * c_Enhancement_Filter_3[6][5];

    sums.w += pixel * c_Enhancement_Filter_4[6][5];

    pixel = image[y + 0][x + 0]; 

    sums.x += pixel * c_Enhancement_Filter_1[5][5];

    sums.y += pixel * c_Enhancement_Filter_2[5][5];

    sums.z += pixel * c_Enhancement_Filter_3[5][5];

    sums.w += pixel * c_Enhancement_Filter_4[5][5];

    pixel = image[y + 1][x + 0]; 

    sums.x += pixel * c_Enhancement_Filter_1[4][5];

    sums.y += pixel * c_Enhancement_Filter_2[4][5];

    sums.z += pixel * c_Enhancement_Filter_3[4][5];

    sums.w += pixel * c_Enhancement_Filter_4[4][5];

    pixel = image[y + 2][x + 0]; 

    sums.x += pixel * c_Enhancement_Filter_1[3][5];

    sums.y += pixel * c_Enhancement_Filter_2[3][5];

    sums.z += pixel * c_Enhancement_Filter_3[3][5];

    sums.w += pixel * c_Enhancement_Filter_4[3][5];

    pixel = image[y + 3][x + 0]; 

    sums.x += pixel * c_Enhancement_Filter_1[2][5];

    sums.y += pixel * c_Enhancement_Filter_2[2][5];

    sums.z += pixel * c_Enhancement_Filter_3[2][5];

    sums.w += pixel * c_Enhancement_Filter_4[2][5];

    pixel = image[y + 4][x + 0]; 

    sums.x += pixel * c_Enhancement_Filter_1[1][5];

    sums.y += pixel * c_Enhancement_Filter_2[1][5];

    sums.z += pixel * c_Enhancement_Filter_3[1][5];

    sums.w += pixel * c_Enhancement_Filter_4[1][5];

    pixel = image[y + 5][x + 0]; 

    sums.x += pixel * c_Enhancement_Filter_1[0][5];

    sums.y += pixel * c_Enhancement_Filter_2[0][5];

    sums.z += pixel * c_Enhancement_Filter_3[0][5];

    sums.w += pixel * c_Enhancement_Filter_4[0][5];

pixel = image[y - 5][x + 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[10][4];

    sums.y += pixel * c_Enhancement_Filter_2[10][4];

    sums.z += pixel * c_Enhancement_Filter_3[10][4];

    sums.w += pixel * c_Enhancement_Filter_4[10][4];

    pixel = image[y - 4][x + 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[9][4];

    sums.y += pixel * c_Enhancement_Filter_2[9][4];

    sums.z += pixel * c_Enhancement_Filter_3[9][4];

    sums.w += pixel * c_Enhancement_Filter_4[9][4];

    pixel = image[y - 3][x + 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[8][4];

    sums.y += pixel * c_Enhancement_Filter_2[8][4];

    sums.z += pixel * c_Enhancement_Filter_3[8][4];

    sums.w += pixel * c_Enhancement_Filter_4[8][4];

    pixel = image[y - 2][x + 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[7][4];

    sums.y += pixel * c_Enhancement_Filter_2[7][4];

    sums.z += pixel * c_Enhancement_Filter_3[7][4];

    sums.w += pixel * c_Enhancement_Filter_4[7][4];

    pixel = image[y - 1][x + 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[6][4];

    sums.y += pixel * c_Enhancement_Filter_2[6][4];

    sums.z += pixel * c_Enhancement_Filter_3[6][4];

    sums.w += pixel * c_Enhancement_Filter_4[6][4];

    pixel = image[y + 0][x + 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[5][4];

    sums.y += pixel * c_Enhancement_Filter_2[5][4];

    sums.z += pixel * c_Enhancement_Filter_3[5][4];

    sums.w += pixel * c_Enhancement_Filter_4[5][4];

    pixel = image[y + 1][x + 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[4][4];

    sums.y += pixel * c_Enhancement_Filter_2[4][4];

    sums.z += pixel * c_Enhancement_Filter_3[4][4];

    sums.w += pixel * c_Enhancement_Filter_4[4][4];

    pixel = image[y + 2][x + 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[3][4];

    sums.y += pixel * c_Enhancement_Filter_2[3][4];

    sums.z += pixel * c_Enhancement_Filter_3[3][4];

    sums.w += pixel * c_Enhancement_Filter_4[3][4];

    pixel = image[y + 3][x + 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[2][4];

    sums.y += pixel * c_Enhancement_Filter_2[2][4];

    sums.z += pixel * c_Enhancement_Filter_3[2][4];

    sums.w += pixel * c_Enhancement_Filter_4[2][4];

    pixel = image[y + 4][x + 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[1][4];

    sums.y += pixel * c_Enhancement_Filter_2[1][4];

    sums.z += pixel * c_Enhancement_Filter_3[1][4];

    sums.w += pixel * c_Enhancement_Filter_4[1][4];

    pixel = image[y + 5][x + 1]; 

    sums.x += pixel * c_Enhancement_Filter_1[0][4];

    sums.y += pixel * c_Enhancement_Filter_2[0][4];

    sums.z += pixel * c_Enhancement_Filter_3[0][4];

    sums.w += pixel * c_Enhancement_Filter_4[0][4];

pixel = image[y - 5][x + 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[10][3];

    sums.y += pixel * c_Enhancement_Filter_2[10][3];

    sums.z += pixel * c_Enhancement_Filter_3[10][3];

    sums.w += pixel * c_Enhancement_Filter_4[10][3];

    pixel = image[y - 4][x + 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[9][3];

    sums.y += pixel * c_Enhancement_Filter_2[9][3];

    sums.z += pixel * c_Enhancement_Filter_3[9][3];

    sums.w += pixel * c_Enhancement_Filter_4[9][3];

    pixel = image[y - 3][x + 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[8][3];

    sums.y += pixel * c_Enhancement_Filter_2[8][3];

    sums.z += pixel * c_Enhancement_Filter_3[8][3];

    sums.w += pixel * c_Enhancement_Filter_4[8][3];

    pixel = image[y - 2][x + 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[7][3];

    sums.y += pixel * c_Enhancement_Filter_2[7][3];

    sums.z += pixel * c_Enhancement_Filter_3[7][3];

    sums.w += pixel * c_Enhancement_Filter_4[7][3];

    pixel = image[y - 1][x + 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[6][3];

    sums.y += pixel * c_Enhancement_Filter_2[6][3];

    sums.z += pixel * c_Enhancement_Filter_3[6][3];

    sums.w += pixel * c_Enhancement_Filter_4[6][3];

    pixel = image[y + 0][x + 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[5][3];

    sums.y += pixel * c_Enhancement_Filter_2[5][3];

    sums.z += pixel * c_Enhancement_Filter_3[5][3];

    sums.w += pixel * c_Enhancement_Filter_4[5][3];

    pixel = image[y + 1][x + 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[4][3];

    sums.y += pixel * c_Enhancement_Filter_2[4][3];

    sums.z += pixel * c_Enhancement_Filter_3[4][3];

    sums.w += pixel * c_Enhancement_Filter_4[4][3];

    pixel = image[y + 2][x + 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[3][3];

    sums.y += pixel * c_Enhancement_Filter_2[3][3];

    sums.z += pixel * c_Enhancement_Filter_3[3][3];

    sums.w += pixel * c_Enhancement_Filter_4[3][3];

    pixel = image[y + 3][x + 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[2][3];

    sums.y += pixel * c_Enhancement_Filter_2[2][3];

    sums.z += pixel * c_Enhancement_Filter_3[2][3];

    sums.w += pixel * c_Enhancement_Filter_4[2][3];

    pixel = image[y + 4][x + 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[1][3];

    sums.y += pixel * c_Enhancement_Filter_2[1][3];

    sums.z += pixel * c_Enhancement_Filter_3[1][3];

    sums.w += pixel * c_Enhancement_Filter_4[1][3];

    pixel = image[y + 5][x + 2]; 

    sums.x += pixel * c_Enhancement_Filter_1[0][3];

    sums.y += pixel * c_Enhancement_Filter_2[0][3];

    sums.z += pixel * c_Enhancement_Filter_3[0][3];

    sums.w += pixel * c_Enhancement_Filter_4[0][3];

pixel = image[y - 5][x + 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[10][2];

    sums.y += pixel * c_Enhancement_Filter_2[10][2];

    sums.z += pixel * c_Enhancement_Filter_3[10][2];

    sums.w += pixel * c_Enhancement_Filter_4[10][2];

    pixel = image[y - 4][x + 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[9][2];

    sums.y += pixel * c_Enhancement_Filter_2[9][2];

    sums.z += pixel * c_Enhancement_Filter_3[9][2];

    sums.w += pixel * c_Enhancement_Filter_4[9][2];

    pixel = image[y - 3][x + 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[8][2];

    sums.y += pixel * c_Enhancement_Filter_2[8][2];

    sums.z += pixel * c_Enhancement_Filter_3[8][2];

    sums.w += pixel * c_Enhancement_Filter_4[8][2];

    pixel = image[y - 2][x + 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[7][2];

    sums.y += pixel * c_Enhancement_Filter_2[7][2];

    sums.z += pixel * c_Enhancement_Filter_3[7][2];

    sums.w += pixel * c_Enhancement_Filter_4[7][2];

    pixel = image[y - 1][x + 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[6][2];

    sums.y += pixel * c_Enhancement_Filter_2[6][2];

    sums.z += pixel * c_Enhancement_Filter_3[6][2];

    sums.w += pixel * c_Enhancement_Filter_4[6][2];

    pixel = image[y + 0][x + 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[5][2];

    sums.y += pixel * c_Enhancement_Filter_2[5][2];

    sums.z += pixel * c_Enhancement_Filter_3[5][2];

    sums.w += pixel * c_Enhancement_Filter_4[5][2];

    pixel = image[y + 1][x + 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[4][2];

    sums.y += pixel * c_Enhancement_Filter_2[4][2];

    sums.z += pixel * c_Enhancement_Filter_3[4][2];

    sums.w += pixel * c_Enhancement_Filter_4[4][2];

    pixel = image[y + 2][x + 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[3][2];

    sums.y += pixel * c_Enhancement_Filter_2[3][2];

    sums.z += pixel * c_Enhancement_Filter_3[3][2];

    sums.w += pixel * c_Enhancement_Filter_4[3][2];

    pixel = image[y + 3][x + 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[2][2];

    sums.y += pixel * c_Enhancement_Filter_2[2][2];

    sums.z += pixel * c_Enhancement_Filter_3[2][2];

    sums.w += pixel * c_Enhancement_Filter_4[2][2];

    pixel = image[y + 4][x + 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[1][2];

    sums.y += pixel * c_Enhancement_Filter_2[1][2];

    sums.z += pixel * c_Enhancement_Filter_3[1][2];

    sums.w += pixel * c_Enhancement_Filter_4[1][2];

    pixel = image[y + 5][x + 3]; 

    sums.x += pixel * c_Enhancement_Filter_1[0][2];

    sums.y += pixel * c_Enhancement_Filter_2[0][2];

    sums.z += pixel * c_Enhancement_Filter_3[0][2];

    sums.w += pixel * c_Enhancement_Filter_4[0][2];

pixel = image[y - 5][x + 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[10][1];

    sums.y += pixel * c_Enhancement_Filter_2[10][1];

    sums.z += pixel * c_Enhancement_Filter_3[10][1];

    sums.w += pixel * c_Enhancement_Filter_4[10][1];

    pixel = image[y - 4][x + 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[9][1];

    sums.y += pixel * c_Enhancement_Filter_2[9][1];

    sums.z += pixel * c_Enhancement_Filter_3[9][1];

    sums.w += pixel * c_Enhancement_Filter_4[9][1];

    pixel = image[y - 3][x + 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[8][1];

    sums.y += pixel * c_Enhancement_Filter_2[8][1];

    sums.z += pixel * c_Enhancement_Filter_3[8][1];

    sums.w += pixel * c_Enhancement_Filter_4[8][1];

    pixel = image[y - 2][x + 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[7][1];

    sums.y += pixel * c_Enhancement_Filter_2[7][1];

    sums.z += pixel * c_Enhancement_Filter_3[7][1];

    sums.w += pixel * c_Enhancement_Filter_4[7][1];

    pixel = image[y - 1][x + 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[6][1];

    sums.y += pixel * c_Enhancement_Filter_2[6][1];

    sums.z += pixel * c_Enhancement_Filter_3[6][1];

    sums.w += pixel * c_Enhancement_Filter_4[6][1];

    pixel = image[y + 0][x + 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[5][1];

    sums.y += pixel * c_Enhancement_Filter_2[5][1];

    sums.z += pixel * c_Enhancement_Filter_3[5][1];

    sums.w += pixel * c_Enhancement_Filter_4[5][1];

    pixel = image[y + 1][x + 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[4][1];

    sums.y += pixel * c_Enhancement_Filter_2[4][1];

    sums.z += pixel * c_Enhancement_Filter_3[4][1];

    sums.w += pixel * c_Enhancement_Filter_4[4][1];

    pixel = image[y + 2][x + 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[3][1];

    sums.y += pixel * c_Enhancement_Filter_2[3][1];

    sums.z += pixel * c_Enhancement_Filter_3[3][1];

    sums.w += pixel * c_Enhancement_Filter_4[3][1];

    pixel = image[y + 3][x + 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[2][1];

    sums.y += pixel * c_Enhancement_Filter_2[2][1];

    sums.z += pixel * c_Enhancement_Filter_3[2][1];

    sums.w += pixel * c_Enhancement_Filter_4[2][1];

    pixel = image[y + 4][x + 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[1][1];

    sums.y += pixel * c_Enhancement_Filter_2[1][1];

    sums.z += pixel * c_Enhancement_Filter_3[1][1];

    sums.w += pixel * c_Enhancement_Filter_4[1][1];

    pixel = image[y + 5][x + 4]; 

    sums.x += pixel * c_Enhancement_Filter_1[0][1];

    sums.y += pixel * c_Enhancement_Filter_2[0][1];

    sums.z += pixel * c_Enhancement_Filter_3[0][1];

    sums.w += pixel * c_Enhancement_Filter_4[0][1];

pixel = image[y - 5][x + 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[10][0];

    sums.y += pixel * c_Enhancement_Filter_2[10][0];

    sums.z += pixel * c_Enhancement_Filter_3[10][0];

    sums.w += pixel * c_Enhancement_Filter_4[10][0];

    pixel = image[y - 4][x + 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[9][0];

    sums.y += pixel * c_Enhancement_Filter_2[9][0];

    sums.z += pixel * c_Enhancement_Filter_3[9][0];

    sums.w += pixel * c_Enhancement_Filter_4[9][0];

    pixel = image[y - 3][x + 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[8][0];

    sums.y += pixel * c_Enhancement_Filter_2[8][0];

    sums.z += pixel * c_Enhancement_Filter_3[8][0];

    sums.w += pixel * c_Enhancement_Filter_4[8][0];

    pixel = image[y - 2][x + 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[7][0];

    sums.y += pixel * c_Enhancement_Filter_2[7][0];

    sums.z += pixel * c_Enhancement_Filter_3[7][0];

    sums.w += pixel * c_Enhancement_Filter_4[7][0];

    pixel = image[y - 1][x + 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[6][0];

    sums.y += pixel * c_Enhancement_Filter_2[6][0];

    sums.z += pixel * c_Enhancement_Filter_3[6][0];

    sums.w += pixel * c_Enhancement_Filter_4[6][0];

    pixel = image[y + 0][x + 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[5][0];

    sums.y += pixel * c_Enhancement_Filter_2[5][0];

    sums.z += pixel * c_Enhancement_Filter_3[5][0];

    sums.w += pixel * c_Enhancement_Filter_4[5][0];

    pixel = image[y + 1][x + 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[4][0];

    sums.y += pixel * c_Enhancement_Filter_2[4][0];

    sums.z += pixel * c_Enhancement_Filter_3[4][0];

    sums.w += pixel * c_Enhancement_Filter_4[4][0];

    pixel = image[y + 2][x + 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[3][0];

    sums.y += pixel * c_Enhancement_Filter_2[3][0];

    sums.z += pixel * c_Enhancement_Filter_3[3][0];

    sums.w += pixel * c_Enhancement_Filter_4[3][0];

    pixel = image[y + 3][x + 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[2][0];

    sums.y += pixel * c_Enhancement_Filter_2[2][0];

    sums.z += pixel * c_Enhancement_Filter_3[2][0];

    sums.w += pixel * c_Enhancement_Filter_4[2][0];

    pixel = image[y + 4][x + 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[1][0];

    sums.y += pixel * c_Enhancement_Filter_2[1][0];

    sums.z += pixel * c_Enhancement_Filter_3[1][0];

    sums.w += pixel * c_Enhancement_Filter_4[1][0];

    pixel = image[y + 5][x + 5]; 

    sums.x += pixel * c_Enhancement_Filter_1[0][0];

    sums.y += pixel * c_Enhancement_Filter_2[0][0];

    sums.z += pixel * c_Enhancement_Filter_3[0][0];

    sums.w += pixel * c_Enhancement_Filter_4[0][0];

	return sums;

}

Thanks, I’ll try to build this as soon as I get in front of a machine with a CUDA toolchain. To first order, this looks similar to some other codes for which I have seen excessive register pressure on sm_20, in that there are lengthy sums of expressions of array elements.

In those cases the entire computation was a single flat kernel, and I worked around the spill issues by breaking up the kernel into multiple kernels, i.e. I used kernel splitting.

There seems to be a bunch of code missing? I can’t actually build the code, see compiler errors below. If you want to, you could put all required source files into a .zip file and attach that to a forum post (this is tidier than posting large chunks of code inline, as that makes the thread become very long).

Here is one observation that may help: All pointers to the kernel are passed with the restrict modifier. In general, this is a good thing as it allows the compiler to optimize more aggressively, for example by being able to extract more common subexpressions. The flip side is that the additional subexpressions extracted tend to burn up additional register, especially if their lifetime spans most of the code. In situations where register usage is already high due to numerous pointers, this can easily be counterproductive. One thing to try is to remove restrict to see whether that brings the register pressure down to the point where spilling is largely eliminated while using 63 registers.

The large functions like Convolve_11x11_Enhancement_4firstfilters() probably all get inlined right now, as the threshold for the inliner is set very high. One interesting experiment would be to see what happens if inlining of these large functions is prevented by use of the noinline attribute. I have insufficient experience with using it, so I cannot predict what will happen.

C:[…]/apps/filter/filter.cu(10): error: identifier “c_Enhancement_Filter_1” is undefined

C:[…]/apps/filter/filter.cu(11): error: identifier “c_Enhancement_Filter_2” is undefined

C:[…]/apps/filter/filter.cu(12): error: identifier “c_Enhancement_Filter_3” is undefined

C:[…]/apps/filter/filter.cu(13): error: identifier “c_Enhancement_Filter_4” is undefined

C:[…]/apps/filter/filter.cu(184): error: identifier “Calculate_4D_Index” is undefined

C:[…]/apps/filter/filter.cu(188): error: identifier “Calculate_4D_Index” is undefined

C:[…]/apps/filter/filter.cu(192): error: identifier “Calculate_4D_Index” is undefined

C:[…]/apps/filter/filter.cu(196): error: identifier “Calculate_4D_Index” is undefined

C:[…]/apps/filter/filter.cu(200): error: identifier “Calculate_4D_Index” is undefined

C:[…]/apps/filter/filter.cu(204): error: identifier “Calculate_4D_Index” is undefined

C:[…]/apps/filter/filter.cu(208): error: identifier “Calculate_4D_Index” is undefined

C:[…]/apps/filter/filter.cu(212): error: identifier “Calculate_4D_Index” is undefined

C:[…]/apps/filter/filter.cu(217): error: identifier “Calculate_3D_Index” is undefined

C:[…]/apps/filter/filter.cu(223): error: identifier “Convolve_11x11_Enhancement_4middlefilters” is undefined

C:[…]/apps/filter/filter.cu(228): error: identifier “Convolve_11x11_Enhancement_4lastfilters” is undefined

C:[…]/apps/filter/filter.cu(234): error: identifier “Calculate_3D_Index” is undefined

C:[…]/apps/filter/filter.cu(240): error: identifier “Convolve_11x11_Enhancement_4middlefilters” is undefined

C:[…]/apps/filter/filter.cu(245): error: identifier “Convolve_11x11_Enhancement_4lastfilters” is undefined

C:[…]/apps/filter/filter.cu(251): error: identifier “Calculate_3D_Index” is undefined

C:[…]/apps/filter/filter.cu(257): error: identifier “Convolve_11x11_Enhancement_4middlefilters” is undefined

C:[…]/apps/filter/filter.cu(262): error: identifier “Convolve_11x11_Enhancement_4lastfilters” is undefined

C:[…]/apps/filter/filter.cu(269): error: identifier “Calculate_3D_Index” is undefined

C:[…]/apps/filter/filter.cu(275): error: identifier “Convolve_11x11_Enhancement_4middlefilters” is undefined

C:[…]/apps/filter/filter.cu(280): error: identifier “Convolve_11x11_Enhancement_4lastfilters” is undefined

C:[…]/apps/filter/filter.cu(286): error: identifier “Calculate_3D_Index” is undefined

C:[…]/apps/filter/filter.cu(292): error: identifier “Convolve_11x11_Enhancement_4middlefilters” is undefined

C:[…]/apps/filter/filter.cu(297): error: identifier “Convolve_11x11_Enhancement_4lastfilters” is undefined

C:[…]/apps/filter/filter.cu(303): error: identifier “Calculate_3D_Index” is undefined

C:[…]/apps/filter/filter.cu(309): error: identifier “Convolve_11x11_Enhancement_4middlefilters” is undefined

C:[…]/apps/filter/filter.cu(314): error: identifier “Convolve_11x11_Enhancement_4lastfilters” is undefined

__device__ __constant__ float c_Enhancement_Filter_1[11][11];

__device__ __constant__ float c_Enhancement_Filter_2[11][11];

__device__ __constant__ float c_Enhancement_Filter_3[11][11];

__device__ __constant__ float c_Enhancement_Filter_4[11][11];

inline __device__ int Calculate_3D_Index(int a, int b, int c, int DATA_A, int DATA_B)

{

	return a + b * DATA_A + c * DATA_A * DATA_B;

}

inline __device__ int Calculate_4D_Index(int a, int b, int c, int d, int DATA_A, int DATA_B, int DATA_C)

{

	return a + b * DATA_A + c * DATA_A * DATA_B + d * DATA_A * DATA_B * DATA_C;

}

The functions Convolve_11x11_Enhancement_4middlefilters and Convolve_11x11_Enhancement_4lastfilters contain almost exactly the same code as Convolve_11x11_Enhancement_4firstfilters, so you can replace these calls with Convolve_11x11_Enhancement_4firstfilters.

The noinline attribute seems to work well…

ptxas info : Compiling entry function ‘_Z34Convolution_2D_Enhancement_Valid_ZPfS_S_S_S_S_S_S_S_S_S_S_iiiiiiiiiiiiiiif’ for ‘sm_20’
ptxas info : Function properties for _Z34Convolution_2D_Enhancement_Valid_ZPfS_S_S_S_S_S_S_S_S_S_S_iiiiiiiiiiiiiiif
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z39Convolve_11x11_Enhancement_4lastfiltersPA64_fii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z40Convolve_11x11_Enhancement_4firstfiltersPA64_fii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z41Convolve_11x11_Enhancement_4middlefiltersPA64_fii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 32 registers, 16384+0 bytes smem, 192 bytes cmem[0], 11056 bytes cmem[2]

By changing the 11 float pointers into one float11 pointer (but keeping the convolution inlined) I’ve managed to get the register usage down to 24 registers. I want to get down to 20 registers, such that I can run 3 thread blocks with 512 threads each, for full occupancy.

Glad to hear things have started moving in the desired direction. Just to be clear, the goal does not necessarily have to be the total avoidance of register spills. I have certainly seen cases where the highest performance configuration incurred some minor spilling. The caches on Fermi can help with some minor spilling, but they get overwhelmed if there is significant spill traffic. So focus should be on measured application performance.

There is, generally speaking, no strong correlation between occupancy and performance. In fact I have experienced multiple cases where higher performance is achieved with configurations of lower occupancy. Certainly a certain minimum number of threads is required to cover various latencies, and for kernels that use __syncthreads() one would want to ensure that at least two thread blocks execute concurrently on each multiprocessor. Given that you already have 1024 threads running, any tweaking to push the register usage down to 20 registers is likely to degrade the per-thread performance slightly, nullifying any small gains to be had from increasing occupancy to 1.0.

Have you looked into thread blocks of finer granularity? A block size of 256 threads (6 x 256 = 1536) or 192 threads (8 x 192 = 1536) is what I typically try to use when I want to achieve high occupancy. If this does not achieve full occupancy due to resource constraints, it could still result in the total number of threads exceeding 1024.

I’ve managed to get down to 22 registers now. If I use maxrregcount = 20 there shouldn’t be a lot of register spills, right?

The tricky part is to maximize the performance with the shared memory, 16 KB per block fits very good if I can manage to run three thread blocks at the same time. I load 64 x 64 pixels into the shared memory and then calculate 48 x 48 valid filter responses, with three concurrent blocks this gives 6912 valid filter responses per multiprocessor. With 6 blocks I only get 8 KB of shared memory per thread block and that reduces the number of valid filter responses per block from 48 x 48 to 48 x 16, i.e. in total 4608 valid filter responses per multiprocessor.