Zero padding - CUDA vs MATLAB Converting MATLAB fft2() to CUFFT using zero padding

Hi, I am trying to convert a matlab code to CUDA. In matlab, the functionY = fft2(X,m,n) truncates X, or pads X with zeros to create an m-by-n array before doing the transform.
I would like to perform a fft2 on 2D filter with the CUFFT library. I did not find any CUDA API function which does zero padding so I implemented my own. This function adds zeros to the inputted matrix as follows (from a 3X3 matrix to a 6X6 matrix):

3 X 3
1 1 1
1 1 1
1 1 1

to

6 X 6
1 1 1 0 0 0
1 1 1 0 0 0
1 1 1 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0

When I zero pad the 2D filter and compute the fft2 (CUFFT) for a square matrix (example 3X3 matrix) the result matches the result of the matlab code. But when the 2D filter is not a square matrix (example 3X6 matrix) the result does not match the result of the matlab code.

Can somebody help me please? Can somebody verify whether the zero padding is done the way I am doing it in matlab?

//-------------------MATLAB code--------------------------------
for (o = 1 : numOrient)
fftf{o} = fft2(allFilter{1, o}, sx+h+h, sy+h+h);
end
//-------------------MATLAB code--------------------------------

//--------------CUDA Code---------------------(Zero padding Kernel)
global void
zeroPadding(cufftComplex* Filter, cufftComplex* InputFilterFFT, int newCols, int newRows, int oldCols, int oldRows)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

if ((x <= oldCols) && (y <= oldRows))
	InputFilterFFT[(y * newCols) + x] = Filter[(y * oldCols) + x];
else
{
	if ((x <= newCols) && (y <= newRows))
	{
		cufftComplex temp;
		temp.x = 0.0; temp.y = 0.0;
		InputFilterFFT[(y * newCols) + x] = temp;
	}
}

}
//--------------Code---------------------(Zero padding Kernel)

int ApplyGaborFilter(cufftComplex *allFilter)
{

for (int i = 0; i < 15; i++)
{
CUT_SAFE_CALL(cudaMemcpy(d_Filter, allFilter[i], (17 * 17 * sizeof(cufftComplex)), cudaMemcpyHostToDevice)); checkforerror();
dim3 dimBlock(16, 16);
dim3 dimGrid(cuiDivUp(cols, dimBlock.x), cuiDivUp(rows, dimBlock.y), 1);
// Call the zero-padding kernel
zeroPadding<<< dimGrid, dimBlock, 0 >>>( d_Filter, d_InputFilterFFT, cols, rows, 17, 17);
CUDA_SAFE_CALL( cudaThreadSynchronize() );

	CUT_SAFE_CALL(cudaMemcpy(FilterFFT[i], d_InputFilterFFT, (cols * rows * sizeof(cufftComplex)), cudaMemcpyDeviceToHost)); checkforerror();

	cufftExecC2C(plan, d_InputFilterFFT, d_OutputFilterFFT, CUFFT_FORWARD);
	CUT_SAFE_CALL(cudaMemcpy(FilterFFT[i], d_OutputFilterFFT, (cols * rows * sizeof(cufftComplex)), cudaMemcpyDeviceToHost)); checkforerror();
}