2D Convolution problem following example from SDK source code included

I have the following bit of code that I am using trying to replicate the SDK example code, and all of the methods called in here are out of the convolution2DFFT source code:

int dcW;

   int halfl;

   const int kSize = 3;

   const int kernelY = 1;

   const int kernelX = 1;

	char tmp[128];

	int id = 0;

FILE *file;

//hp_kern is kSize x kSize

    unsigned char * charData;

float

        *h_ResultGPU,

	*float_h_data;

float

        *d_Data,

        *d_PaddedData,

        *d_Kernel,

        *d_PaddedKernel;

cufftHandle

        fftPlanFwd,

        fftPlanInv;

fComplex

        *d_DataSpectrum,

        *d_KernelSpectrum;

const int    fftH = snapTransformSize(chipH + kSize - 1);//snaps to power of 2 or 512 multiple

   const int    fftW = snapTransformSize(chipW + kSize - 1);

	setHeight(chipH);

	setWidth(chipW);

printf("fftH: %i, fftW: %i \n", fftH, fftW);

	float_h_data = (float *)malloc(chipW*chipH*sizeof(float));

/* I used this code to verify that float_h_data is in fact correct! Displaying this csv as an image in matlab matches

 * the output by using imread() on the chip size in matlab

*/

	sprintf(tmp, "orig_float_data.csv");

	file = fopen(tmp, "wb");

	for(size_t i = 0 ; i < chipH ; i++)

	{

		for(size_t j = 0 ; j < chipW ; j++)

		{

		float_h_data[i*chipW + j] = (float)h_data[i*chipW+j];

		sprintf(tmp, "%f,", float_h_data[i*chipW + j]); 

		fwrite(tmp, strlen(tmp), 1, file);

		}

		sprintf(tmp,"\n");

		fwrite(tmp, strlen(tmp), 1, file);

		

	}

	fclose(file);

printf("...allocating memory\n");

h_ResultGPU = (float *)malloc(fftH * fftW * sizeof(float));

cutilSafeCall( cudaMalloc((void **)&d_Data,   chipH   * chipW   * sizeof(float)) );

        cutilSafeCall( cudaMalloc((void **)&d_Kernel, kSize * kSize * sizeof(float)) );

cutilSafeCall( cudaMalloc((void **)&d_PaddedData,   fftH * fftW * sizeof(float)) );

cutilSafeCall( cudaMalloc((void **)&d_PaddedKernel, fftH * fftW * sizeof(float)) );

cutilSafeCall( cudaMalloc((void **)&d_DataSpectrum,   fftH * (fftW / 2 + 1) * sizeof(fComplex)) );

        cutilSafeCall( cudaMalloc((void **)&d_KernelSpectrum, fftH * (fftW / 2 + 1) * sizeof(fComplex)) );

//PLANS

    printf("...creating R2C & C2R FFT plans for %i x %i\n", fftH, fftW);

        cufftSafeCall( cufftPlan2d(&fftPlanFwd, fftH, fftW, CUFFT_R2C) );

        cufftSafeCall( cufftPlan2d(&fftPlanInv, fftH, fftW, CUFFT_C2R) );

//UPLOAD DATA TO DEVICE

printf("...uploading to GPU and padding convolution kernel and input data\n");

        cutilSafeCall( cudaMemcpy(d_Kernel, hp_kernel, kSize * kSize * sizeof(float), cudaMemcpyHostToDevice) );

cutilSafeCall( cudaMemcpy(d_Data,   float_h_data,   chipH   * chipW *   sizeof(float), cudaMemcpyHostToDevice) );

        cutilSafeCall( cudaMemset(d_PaddedKernel, 0, fftH * fftW * sizeof(float)) );

        cutilSafeCall( cudaMemset(d_PaddedData,   0, fftH * fftW * sizeof(float)) );

//adjust the padKernel and border conditions

padKernel(d_PaddedKernel, d_Kernel, fftH, fftW, kSize, kSize, kernelY, kernelX);

	//pad Kernel is working as described in the cufft2D paper. 

padDataClampToBorder( d_PaddedData, d_Data, fftH, fftW, chipH, chipW, kSize, kSize, kernelY, kernelX);

	charData = (unsigned char *)malloc(chipH *chipW * sizeof(unsigned char));

printf("...transforming convolution kernel\n");

        cufftSafeCall( cufftExecR2C(fftPlanFwd, (cufftReal *)d_PaddedKernel, (cufftComplex *)d_KernelSpectrum) );

printf("...running GPU FFT convolution: ");

        cufftSafeCall( cufftExecR2C(fftPlanFwd, (cufftReal *)d_PaddedData, (cufftComplex *)d_DataSpectrum) );

	

	modulateAndNormalize(d_DataSpectrum, d_KernelSpectrum, fftH, fftW, 1);

cufftSafeCall( cufftExecC2R(fftPlanInv, (cufftComplex *)d_DataSpectrum, (cufftReal *)d_PaddedData) ); //puts DataSpectrum back to float

printf("...reading back GPU convolution results\n");

        cutilSafeCall( cudaMemcpy(h_ResultGPU, d_PaddedData, fftH * fftW * sizeof(float), cudaMemcpyDeviceToHost) );

	for(size_t i = 0 ; i < chipH ; i++)

	{

		for(size_t j = 0 ; j < chipW ; j++)

		{

			charData[i*chipH + j] = (unsigned char)h_ResultGPU[i*fftW+j];

		

                }

	}

	return charData;

So I am basically passing in h_data which is an unsigned char array, and converting it to float, and then performing the FFT’s on the padded kernel and data matrices. As I mentioned in the code comments, I verified that in fact I am passing the right data there, because I can read that .csv file into matlab and display the image before processing properly.

However, when I try ti display charData after the return, it is coming back looking like noise/snow whenever I run this on my input jpg data (which has been reduced to grayscale).

Also, my kernel is a 3x3 do-nothing kernel, so the output should be exactly like the input.

Can anyone see anything wrong in my code?

bump…anyone?

Bump again - No one on this forum has ever tried to implement your own image filter using the 2D convolution example and gotten strange results?

Well i do, i don’t know why …
When i try without “padKernel” and “padDataClampToBorder” the result is visually better, but not exactly what we want of course …

To perform the convolution of an image by another one, we have to use this 2DConvolution and not the convolutionSeparable right ?

In my case, I ended up having to create some code to remap the results from the GPU to an unsigned character array. Just trying to cast the results back to (unsigned char) did not display properly.

The remap code I think is proprietary, so I cant show it, but I basically had to calculate the mean, and mapped it with some log based functions.

I think you had to remap the results to an unsigned character array to save the convoluted image in a special format ?
I just want to save it as raw data …

So, does anyone have any idea about the thing to do to make it work ? (did someone have already tried it ?)

i don’t understand what is wrong …

bool test1(void){

    float

        *h_Data,

        *h_Kernel,

        *h_ResultGPU;

float

        *d_Data,

        *d_PaddedData,

        *d_Kernel,

        *d_PaddedKernel;

fComplex

        *d_DataSpectrum,

        *d_KernelSpectrum;

cufftHandle

        fftPlanFwd,

        fftPlanInv;

unsigned int hTimer;

    cutilCheckError( cutCreateTimer(&hTimer) );

printf("Testing built-in R2C / C2R FFT-based convolution\n");

        const int kernelH = 6;

        const int kernelW = 6;

        const int kernelY = 3;

        const int kernelX = 3;

        const int   dataH = 412;

        const int   dataW = 512;

        const int    fftH = snapTransformSize(dataH + kernelH - 1);

        const int    fftW = snapTransformSize(dataW + kernelW - 1);

printf("...allocating memory\n");

        h_Data      = (float *)malloc(dataH   * dataW * sizeof(float));

        h_Kernel    = (float *)malloc(kernelH * kernelW * sizeof(float));

        h_ResultGPU = (float *)malloc(fftH    * fftW * sizeof(float));

cutilSafeCall( cudaMalloc((void **)&d_Data,   dataH   * dataW   * sizeof(float)) );

        cutilSafeCall( cudaMalloc((void **)&d_Kernel, kernelH * kernelW * sizeof(float)) );

cutilSafeCall( cudaMalloc((void **)&d_PaddedData,   fftH * fftW * sizeof(float)) );

        cutilSafeCall( cudaMalloc((void **)&d_PaddedKernel, fftH * fftW * sizeof(float)) );

cutilSafeCall( cudaMalloc((void **)&d_DataSpectrum,   fftH * (fftW / 2 + 1) * sizeof(fComplex)) );

        cutilSafeCall( cudaMalloc((void **)&d_KernelSpectrum, fftH * (fftW / 2 + 1) * sizeof(fComplex)) );

	//==========================================================

	//==========================================================

    printf("... reading input data\n");

	

	FILE* Fichier_image;

	Fichier_image = check_filename("bridge.raw","rb");

	int ni = 0;

	for (ni = 0; ni< dataH*dataW; ni++) 

		{

		fread( &h_Data[ni], 1, 1, Fichier_image);

		}

for(int i = 0; i < kernelH * kernelW; i++)

	   {h_Kernel[i] = 0;}

		h_Kernel[kernelX*kernelY] = 1;

			

	//==========================================================

	//==========================================================

printf("...creating R2C & C2R FFT plans for %i x %i\n", fftH, fftW);

        cufftSafeCall( cufftPlan2d(&fftPlanFwd, fftH, fftW, CUFFT_R2C) );

        cufftSafeCall( cufftPlan2d(&fftPlanInv, fftH, fftW, CUFFT_C2R) );

printf("...uploading to GPU and padding convolution kernel and input data\n");

        cutilSafeCall( cudaMemcpy(d_Kernel, h_Kernel, kernelH * kernelW * sizeof(float), cudaMemcpyHostToDevice) );

        cutilSafeCall( cudaMemcpy(d_Data,   h_Data,   dataH   * dataW *   sizeof(float), cudaMemcpyHostToDevice) );

        cutilSafeCall( cudaMemset(d_PaddedKernel, 0, fftH * fftW * sizeof(float)) );

        cutilSafeCall( cudaMemset(d_PaddedData,   0, fftH * fftW * sizeof(float)) );

padKernel(

            d_PaddedKernel,

            d_Kernel,

            fftH,

            fftW,

            kernelH,

            kernelW,

            kernelY,

            kernelX

        );

padDataClampToBorder(

            d_PaddedData,

            d_Data,

            fftH,

            fftW,

            dataH,

            dataW,

            kernelH,

            kernelW,

            kernelY,

            kernelX

        );

//Not including kernel transformation into time measurement,

    //since convolution kernel is not changed very frequently

    printf("...transforming convolution kernel\n");

        cufftSafeCall( cufftExecR2C(fftPlanFwd, (cufftReal *)d_PaddedKernel, (cufftComplex *)d_KernelSpectrum) );

printf("...running GPU FFT convolution: ");

        cutilSafeCall( cutilDeviceSynchronize() );

        cutilCheckError( cutResetTimer(hTimer) );

        cutilCheckError( cutStartTimer(hTimer) );

        cufftSafeCall( cufftExecR2C(fftPlanFwd, (cufftReal *)d_PaddedData, (cufftComplex *)d_DataSpectrum) );

        modulateAndNormalize(d_DataSpectrum, d_KernelSpectrum, fftH, fftW, 1);

        cufftSafeCall( cufftExecC2R(fftPlanInv, (cufftComplex *)d_DataSpectrum, (cufftReal *)d_PaddedData) );

cutilSafeCall( cutilDeviceSynchronize() );

        cutilCheckError( cutStopTimer(hTimer) );

        double gpuTime = cutGetTimerValue(hTimer);

    printf("%f MPix/s (%f ms)\n", (double)dataH * (double)dataW * 1e-6 / (gpuTime * 0.001), gpuTime);

printf("...reading back GPU convolution results\n");

        cutilSafeCall( cudaMemcpy(h_ResultGPU, d_PaddedData, fftH * fftW * sizeof(float), cudaMemcpyDeviceToHost) );

	//==========================================================

	//==========================================================

		

			

		FILE* Fichier_image2;

		Fichier_image2 = check_filename("bridge5.raw","wb");

		ni = 0;

		for (ni = 0; ni< fftH * fftW; ni++) 

		{

		fwrite( &h_ResultGPU[ni], 1, 1, Fichier_image2);

		}

	

	//==========================================================

	//==========================================================

printf("...shutting down\n");

        cutilCheckError( cutDeleteTimer(hTimer) );

cufftSafeCall( cufftDestroy(fftPlanInv) );

        cufftSafeCall( cufftDestroy(fftPlanFwd) );

cutilSafeCall( cudaFree(d_DataSpectrum)   );

        cutilSafeCall( cudaFree(d_KernelSpectrum) );

        cutilSafeCall( cudaFree(d_PaddedData)   );

        cutilSafeCall( cudaFree(d_PaddedKernel) );

        cutilSafeCall( cudaFree(d_Data)   );

        cutilSafeCall( cudaFree(d_Kernel) );

free(h_ResultGPU);

        free(h_Data);

        free(h_Kernel);

		return 1;

}

I only added code between //====. I’m just reading and copying raw data images.

I tried with a kernel 1x1 with only 1 pixel to 1, and got wrong results.

I’m sure my image is 512x412, and i’ve already tried 400x300 with another image.

Nothing works. Any idea ?

ps : to check the result I open the raw data Image with ImageJ. I noticed that the “dead pixel” bug appears in the brighter zone of the image …

Find attached the initial image to the left 400x300, and to right the result image 512x512 with the predicted blur effect, but with many faults …

EDIT : i tried to do FFT and inverse FFT only and didn’t success … maybe i’m doing something wrong with something simpler than fft2D … i opened a more specific topic : http://forums.nvidia.com/index.php?showtopic=201610

I tried to just take the pixels I want, now it’s better but they are still faults in the image (black pixel). Maybe I’m losing information doing that anyway …

for (int n = 0; n <dataH; n++)

			{

				for (ni = n*fftW; ni< n*fftW+dataW; ni++) 

				{

						fwrite( &h_ResultGPU[ni], 1, 1, Fichier_image2);

				}

			}

[EDIT] The faults were caused by the format changement (from unsigned char to float) it’s okay now, but it doesn’t work for correlation

Last bump … Why do we have to reorganize the data to have a good result, is it normal ?

I saw that in the initial image, the comparison of the results (GPU/CPU) is just made from 0 to dataW (so, not to fftW) as I did in my reorganization.

for(int y = 0; y < dataH; y++)

            for(int x = 0; x < dataW; x++){

                double  rCPU = (double)h_ResultCPU[y * dataW + x];

                double  rGPU = (double)h_ResultGPU[y * fftW  + x];

                double delta = (rCPU - rGPU) * (rCPU - rGPU);

Even if I can select the data I want and apply a mean filter, I don’t success to do the correlation between one image and a part of this image (the results are weird). Anyone has an idea ?

Okay, this one is the last dump, i’m desesperate …

What is the good way to fill the Kernel ?
I don’t understand why i can’t have a good result by using 2 images as data, and by doing the convolution with this code …

Do we have to deal ourselves with the kernel padclampToBorder ? (when we fill the kernel with float data ?)