I have a CUDA kernel with following configuration.
Block Dim: 16 * 32
Grid Dim: 128 * 907
I need to process a float2 array (pComplexNumber_i) of dimension 2048 * 32 * 907 by multiplying each row [2048 elements] of the float2 array with a single dimensional array (pfFilterArray_i). Multiplication will be 1 to 1, i.e element 1 of pComplexNumber_i with element 1 of pfFilterArray_i and so on…
Following is the kernel code for the same.
Kernel Code:.
__global__ void ApplyFilter( float* pfFilterArray_i,
Complex* pComplexNumber_i,
int nWidth_i )
{
// This will give the filter column
// eg if nIndex3 is 912 and iWidth is 912 then nIndex3 % iWidth is 0
// eg if nIndex3 is 913 and iWidth is 912 then nIndex3 % iWidth is 1
unsigned nIndexX= IMAD(blockDim.x, blockIdx.x, threadIdx.x);
unsigned nIndexY= IMAD(blockDim.y, blockIdx.y, threadIdx.y);
unsigned nIndex = IMAD(nIndexY, nWidth_i, nIndexX);
const float fFilterValue = pfFilterArray_i[nIndexX];
pComplexNumber_i[nIndex].x *= fFilterValue;
pComplexNumber_i[nIndex].y *= fFilterValue;
}
nIndexX has a range of 0 to 2048
nIndexY has a range of 0 to 32 * 907
nWidth_i is 2048
Both pComplexNumber_i and pfFilterArray_i are in global memory.
GPU used is Tesla C1060.
Currently I am getting a memory throughput of 80.4662 and instruction throughput of 0.284583 when checked with Cuda Visual Profiler for Tesla C1060. As per these values memory throughtput is almost maximum and instruction throughput is very low and execution time is 6542.64 microseconds.
Please let me know if any possible method to optimize this kernel further to increase performance.?
I have a CUDA kernel with following configuration.
Block Dim: 16 * 32
Grid Dim: 128 * 907
I need to process a float2 array (pComplexNumber_i) of dimension 2048 * 32 * 907 by multiplying each row [2048 elements] of the float2 array with a single dimensional array (pfFilterArray_i). Multiplication will be 1 to 1, i.e element 1 of pComplexNumber_i with element 1 of pfFilterArray_i and so on…
Following is the kernel code for the same.
Kernel Code:.
__global__ void ApplyFilter( float* pfFilterArray_i,
Complex* pComplexNumber_i,
int nWidth_i )
{
// This will give the filter column
// eg if nIndex3 is 912 and iWidth is 912 then nIndex3 % iWidth is 0
// eg if nIndex3 is 913 and iWidth is 912 then nIndex3 % iWidth is 1
unsigned nIndexX= IMAD(blockDim.x, blockIdx.x, threadIdx.x);
unsigned nIndexY= IMAD(blockDim.y, blockIdx.y, threadIdx.y);
unsigned nIndex = IMAD(nIndexY, nWidth_i, nIndexX);
const float fFilterValue = pfFilterArray_i[nIndexX];
pComplexNumber_i[nIndex].x *= fFilterValue;
pComplexNumber_i[nIndex].y *= fFilterValue;
}
nIndexX has a range of 0 to 2048
nIndexY has a range of 0 to 32 * 907
nWidth_i is 2048
Both pComplexNumber_i and pfFilterArray_i are in global memory.
GPU used is Tesla C1060.
Currently I am getting a memory throughput of 80.4662 and instruction throughput of 0.284583 when checked with Cuda Visual Profiler for Tesla C1060. As per these values memory throughtput is almost maximum and instruction throughput is very low and execution time is 6542.64 microseconds.
Please let me know if any possible method to optimize this kernel further to increase performance.?
you could try to reduce the numbers of threads by combining them
let the kernel operate on more elements than just one
make a loop about 32 elements and pragma unroll it ;>
you could try to reduce the numbers of threads by combining them
let the kernel operate on more elements than just one
make a loop about 32 elements and pragma unroll it ;>
You can loop over [font=“Courier New”]nIndexY[/font] inside your kernel so that you can reuse [font=“Courier New”]fFilterValue[/font] without loading it again. This would give a 20% improvement.
You may combine this kernel with the previous or next one operating on the data.
If this is the only operation on the data, you may just do it on the CPU, because it’s not worth copying the data over to the GPU.
You can loop over [font=“Courier New”]nIndexY[/font] inside your kernel so that you can reuse [font=“Courier New”]fFilterValue[/font] without loading it again. This would give a 20% improvement.
You may combine this kernel with the previous or next one operating on the data.
If this is the only operation on the data, you may just do it on the CPU, because it’s not worth copying the data over to the GPU.
Thanks for the input. But this kernel cannot be fused with other kernels. The input “pComplexNumber_i” is given from an fft.
I am taking fft of pComplexNumber_i and input it to ApplyFilter.
cufftExecC2C (pComplexNumber_i, CUFFT_FORWARD )
ApplyFilter<<<…, …>>> (…, pComplexNumber_i, …)
cufftExecC2C(pComplexNumber_i, CUFFT_INVERSE )
Other kernels are not using fft complex data. So ApplyFilter cannot be fused with other kernels.
Please let me know if any other ideas to optimize this kernel.
Thanks for the input. But this kernel cannot be fused with other kernels. The input “pComplexNumber_i” is given from an fft.
I am taking fft of pComplexNumber_i and input it to ApplyFilter.
cufftExecC2C (pComplexNumber_i, CUFFT_FORWARD )
ApplyFilter<<<…, …>>> (…, pComplexNumber_i, …)
cufftExecC2C(pComplexNumber_i, CUFFT_INVERSE )
Other kernels are not using fft complex data. So ApplyFilter cannot be fused with other kernels.
Please let me know if any other ideas to optimize this kernel.