Optimization for small kernels How to optimize small kernels with less instructions

Hi,

I have a CUDA kernel with following configuration.

Block Dim: 16 * 32

Grid Dim: 128 * 907

I need to process a float2 array (pComplexNumber_i) of dimension 2048 * 32 * 907 by multiplying each row [2048 elements] of the float2 array with a single dimensional array (pfFilterArray_i). Multiplication will be 1 to 1, i.e element 1 of pComplexNumber_i with element 1 of pfFilterArray_i and so on…

Following is the kernel code for the same.

Kernel Code:.

__global__ void ApplyFilter( float* pfFilterArray_i, 

									 Complex* pComplexNumber_i,

									 int nWidth_i )

		{

			// This will give the filter column

			// eg if nIndex3 is 912 and iWidth is 912 then nIndex3 % iWidth is 0

			// eg if nIndex3 is 913 and iWidth is 912 then nIndex3 % iWidth is 1

			unsigned nIndexX= IMAD(blockDim.x, blockIdx.x, threadIdx.x);

			unsigned nIndexY= IMAD(blockDim.y, blockIdx.y, threadIdx.y);

			unsigned nIndex = IMAD(nIndexY, nWidth_i, nIndexX);

			const float fFilterValue = pfFilterArray_i[nIndexX];

			pComplexNumber_i[nIndex].x *= fFilterValue;

			pComplexNumber_i[nIndex].y *= fFilterValue;

		}

nIndexX has a range of 0 to 2048

nIndexY has a range of 0 to 32 * 907

nWidth_i is 2048

Both pComplexNumber_i and pfFilterArray_i are in global memory.

GPU used is Tesla C1060.

Currently I am getting a memory throughput of 80.4662 and instruction throughput of 0.284583 when checked with Cuda Visual Profiler for Tesla C1060. As per these values memory throughtput is almost maximum and instruction throughput is very low and execution time is 6542.64 microseconds.

Please let me know if any possible method to optimize this kernel further to increase performance.?

Hi,

I have a CUDA kernel with following configuration.

Block Dim: 16 * 32

Grid Dim: 128 * 907

I need to process a float2 array (pComplexNumber_i) of dimension 2048 * 32 * 907 by multiplying each row [2048 elements] of the float2 array with a single dimensional array (pfFilterArray_i). Multiplication will be 1 to 1, i.e element 1 of pComplexNumber_i with element 1 of pfFilterArray_i and so on…

Following is the kernel code for the same.

Kernel Code:.

__global__ void ApplyFilter( float* pfFilterArray_i, 

									 Complex* pComplexNumber_i,

									 int nWidth_i )

		{

			// This will give the filter column

			// eg if nIndex3 is 912 and iWidth is 912 then nIndex3 % iWidth is 0

			// eg if nIndex3 is 913 and iWidth is 912 then nIndex3 % iWidth is 1

			unsigned nIndexX= IMAD(blockDim.x, blockIdx.x, threadIdx.x);

			unsigned nIndexY= IMAD(blockDim.y, blockIdx.y, threadIdx.y);

			unsigned nIndex = IMAD(nIndexY, nWidth_i, nIndexX);

			const float fFilterValue = pfFilterArray_i[nIndexX];

			pComplexNumber_i[nIndex].x *= fFilterValue;

			pComplexNumber_i[nIndex].y *= fFilterValue;

		}

nIndexX has a range of 0 to 2048

nIndexY has a range of 0 to 32 * 907

nWidth_i is 2048

Both pComplexNumber_i and pfFilterArray_i are in global memory.

GPU used is Tesla C1060.

Currently I am getting a memory throughput of 80.4662 and instruction throughput of 0.284583 when checked with Cuda Visual Profiler for Tesla C1060. As per these values memory throughtput is almost maximum and instruction throughput is very low and execution time is 6542.64 microseconds.

Please let me know if any possible method to optimize this kernel further to increase performance.?

you could try to reduce the numbers of threads by combining them
let the kernel operate on more elements than just one
make a loop about 32 elements and pragma unroll it ;>

you could try to reduce the numbers of threads by combining them
let the kernel operate on more elements than just one
make a loop about 32 elements and pragma unroll it ;>

but since this kernel contains very little arithmetic and more dependent on memory speed. will this thread reduction create a significant improvement?

but since this kernel contains very little arithmetic and more dependent on memory speed. will this thread reduction create a significant improvement?

You can loop over [font=“Courier New”]nIndexY[/font] inside your kernel so that you can reuse [font=“Courier New”]fFilterValue[/font] without loading it again. This would give a 20% improvement.

You may combine this kernel with the previous or next one operating on the data.
If this is the only operation on the data, you may just do it on the CPU, because it’s not worth copying the data over to the GPU.

You can loop over [font=“Courier New”]nIndexY[/font] inside your kernel so that you can reuse [font=“Courier New”]fFilterValue[/font] without loading it again. This would give a 20% improvement.

You may combine this kernel with the previous or next one operating on the data.
If this is the only operation on the data, you may just do it on the CPU, because it’s not worth copying the data over to the GPU.

Hi,

Thanks for your reply. I have looped over [font=“Courier New”]nIndexY[/font] inside my kernel and reused [font=“Courier New”]fFilterValue[/font].

Now i have the kernel with size 2048 elements in x dimension (nIndexX) and y dimension as 1 (nIndexY).

But there was not significant improvement. The execution time has increased to 6862.08.

I have also changed global memory array pfFilterArray_i to constant memory, but it could save only 300 micro seconds.

Is there any other method to optimize this kernel.?

Hi,

Thanks for your reply. I have looped over [font=“Courier New”]nIndexY[/font] inside my kernel and reused [font=“Courier New”]fFilterValue[/font].

Now i have the kernel with size 2048 elements in x dimension (nIndexX) and y dimension as 1 (nIndexY).

But there was not significant improvement. The execution time has increased to 6862.08.

I have also changed global memory array pfFilterArray_i to constant memory, but it could save only 300 micro seconds.

Is there any other method to optimize this kernel.?

may be, this kernel is sandwiched between other kernel operations which uses the same data

may be, this kernel is sandwiched between other kernel operations which uses the same data

Perfect - just fuse the kernel with the previous or next one which accesses the data, so that the kernels share the memory access costs.

Perfect - just fuse the kernel with the previous or next one which accesses the data, so that the kernels share the memory access costs.

// post deleted

// post deleted

Thanks for the input. But this kernel cannot be fused with other kernels. The input “pComplexNumber_i” is given from an fft.

I am taking fft of pComplexNumber_i and input it to ApplyFilter.

cufftExecC2C (pComplexNumber_i, CUFFT_FORWARD )

ApplyFilter<<<…, …>>> (…, pComplexNumber_i, …)

cufftExecC2C(pComplexNumber_i, CUFFT_INVERSE )

Other kernels are not using fft complex data. So ApplyFilter cannot be fused with other kernels.
Please let me know if any other ideas to optimize this kernel.

Thanks for the input. But this kernel cannot be fused with other kernels. The input “pComplexNumber_i” is given from an fft.

I am taking fft of pComplexNumber_i and input it to ApplyFilter.

cufftExecC2C (pComplexNumber_i, CUFFT_FORWARD )

ApplyFilter<<<…, …>>> (…, pComplexNumber_i, …)

cufftExecC2C(pComplexNumber_i, CUFFT_INVERSE )

Other kernels are not using fft complex data. So ApplyFilter cannot be fused with other kernels.
Please let me know if any other ideas to optimize this kernel.

So how does the execution time of ApplyFilter() relate to that of cufftExecC2C then?

So how does the execution time of ApplyFilter() relate to that of cufftExecC2C then?