Difference between Device emulation and execution modes

Hi Everyone,

I have been stuck with this problem for long…trying out everything possible…will be really really thankful if someone could help!! I will describe my problem here in points:

  1. I am using GeForce 9800 GTX+ on a 32-bit Windows-XP machine. I am using C with CUDA 2.0. Half the project is in matlab and half in C (the GPU kernel is in C ofcourse) and I am using nvmex functions/conversions. I using Matlab R2008a and VS2005 (8.0). The underlying compiler is the VS compiler.

  2. I have this very simple .cu code which is some 8 convolutions over an image and then addition of these 8 convoluted images…not very compute intensive…just 64 adds and 64+256 multiplies per pixel.

  3. Outputs from GPU emulation and execution do not match. I wouldn’t say they are completely and unacceptably different…but whatever little accuracy I get with GPU execution is just not sufficient for my purpose. So I have been trying a lot of different things to check for the possible error.

  4. I completely simplified the code…removed all shared memory accesses. Everything is loaded and stored in global memory and registers. I have forgotten the basic aim of speed-up…all I am concentrating on is accuracy. I have simplified the code to a very great extent.

  5. I am not using double at all. Also, I type-casted all intermediate operations / results explicitly as floats so that I can eliminate use of extended precision on CPU (emulation mode). The emulation mode result has moved away from the original reference…but still its not close to what the GPU execution is giving.

  6. I wrote a function to check if the CPU was handling any denormalized numbers in the emulation mode leading to higher accuracy…and if yes, then kill the denormalized number to zero and thus behave more like the GPU. But I am very convinced that the CPU is not coming across any denormalized numbers in the entire course of execution of the application.

  7. I used __fadd_rn and __fmul_rn to decompose FMAD. It negligibly reduced the relative error. I even tried changing the sequence of floating point arithmetic operations [(a+B)+(c+d), (a+B)+c)+d), etc.]…but did not help. I tried even Kahan Summation to increase floating point accuracy…but no improvement.

  8. I am sure about the CUDA/GPU initialization…I had picked that routine up from one of the sample codes in the SDK…have already tried it with other test codes that I had written before. CUDA error handling never gave any errors. I can’t use error handling in this project since nvmex doesn’t allow that…but I am very sure that none of the CUDA system calls are creating an issue…I have tested by transferring intermediate results at all steps back to CPU and printed the results from there…and never found any discrepancy in whatever is exchanged between the CPU and the GPU in the execution mode.

  9. I tried with 1block-1thread and also with 1block-32threads kernel configurations to check if there was a race condition or memory access issue…but nothing seems to improve the GPU results.

  10. When I write simple test codes for some say 10000 floating point additions in a loop, the results are the same from both execution and emulation. This makes me feel that there is some small thing that I am missing. Its the same piece of code compiled for emulation and execution separately :( :(

I tried everything I could think of (more than what I have described above)…to improve GPU execution results and get them close to the emulation results from the CPU. When that did not help, I tried to remove the discrepancy between CPU and GPU from the CPU (killing denormalized numbers, strict single precision arithmetic, etc.) to see if the CPU results degrade and come close to the execution results. Nothing seems to change things drastically and get the two results even within acceptable error limits. I have checked things at intermediate steps too…error gradually builds up…but it begins with a big error in the first place.

I am saturated…don’t know what more to try. I am sharing my code (.cu part) below if someone wants to see it (I keep changing the constant BIN_QTY to run for different number of threads). Please suggest. I cannot get funds to buy a higher end equipment (say 200GTX that supports double precision) till I have proved that its the device’s fault and not mine. I don’t want to be stuck in the same problem even with an expensive equipment.

[/code]

/************************************************************

********

  This CUDA program deals with only the convolutions part of CSDD

************************************************************

*********/

#include <stdio.h>

#include <stdlib.h>

#include "math.h"

#include "matrix.h"

#include "mex.h"

#include "cuda_runtime.h"

#include "cuda_runtime_api.h" 

#include "cuda.h"

#include "math_functions.h"

#define BIN_QTY 32

/************************************************************

************/

/* Init CUDA															*/

/************************************************************

************/

#if __DEVICE_EMULATION__

bool InitCUDA(void){return true;}

#else

bool InitCUDA(void)

{

	int count = 0;

	int i = 0;

	cudaGetDeviceCount(&count);

	if(count == 0) {

		fprintf(stderr, "There is no device.\n");

		return false;

	}

	for(i = 0; i < count; i++) {

		cudaDeviceProp prop;

		if(cudaGetDeviceProperties(&prop, i) == cudaSuccess) {

			if(prop.major >= 1) {

				break;

			}

		}

	}

	if(i == count) {

		fprintf(stderr, "There is no device supporting CUDA.\n");

		return false;

	}

	cudaSetDevice(i);

	printf("CUDA initialized.\n");

	return true;

}

#endif

/************************************************************

******************/

/* The Code Begins - Equivalent of dericheCSDEMD(inimage,gaucoeffs,gau2coeffs)*/

/************************************************************

******************/

__device__ float *inimage, *scoreim, *tmpimage, *filt;

__device__ float *rowcoeffs, *colcoeffs;

__device__ int numRows, numCols;

extern "C"

__global__ static void ProcessFrame_DeviceKernel(float* inimage, float* scoreim, float* tmpimage, float* filt, float* rowcoeffs, float* colcoeffs, int numRows, int numCols)

{

	int i,Tx;

		

	// 1D Thread ID

	Tx = threadIdx.x;

	

	__syncthreads();

//============================================================

=========

	//STEP THROUGH 128 GREY(float)SCALE CHANNELS 0-1, 2-3, ..., 254-255 AND DO

	//LAPLACIAN OF GAUSSIAN CSD FILTERING, ACCUMULATING RESULT INTO SCOREIM

	int grey = 2*Tx+1;

	float dgrey = (float)((float)grey/(float)256.0);

	int j,r,c;

	float asum, dtmp;

	float M0,M1,M2,M3,M4;			  //memory registers

	//============================================================

=========

	//Rounds 1 and 2: Calculations for the first half of FILT1

	__syncthreads();

	i=0;	

	//FORWARD FILTER PASS ALONG ROWS for FILT1	  

	asum = (float)(1.0 + rowcoeffs[0] + rowcoeffs[1] + rowcoeffs[2] + rowcoeffs[3]);

	for (r=0; r < numRows; r++) {

	M1 = M2 = M3 = M4 = (((float)(inimage[i+r]<=dgrey)) / asum);

		for (c=0; c < numCols; c++) {

			dtmp = ((float)(inimage[(i+r)+(c*numRows)]<=dgrey));

			M0 =  (float)dtmp - (float)(rowcoeffs[0]*M1) - (float)(rowcoeffs[1]*M2) - (float)(rowcoeffs[2]*M3) - (float)(rowcoeffs[3]*M4); 

						tmpimage[(Tx*numRows*numCols)+(i+r)+(c*numRows)] += (float)(rowcoeffs[4]*M0) + (float)(rowcoeffs[5]*M1) + (float)(rowcoeffs[6]*M2) + (float)(rowcoeffs[7]*M3);

			M4 = M3; M3 = M2; M2 = M1; M1 = M0;

		}	

	}																																														

	//if (Tx==0) {

	//  for (j=0;j<numRows*numCols;j=j+10000)

	//		mexPrintf("Pixel[%d] = %.40f.\n",j,tmpimage[(Tx*numRows*numCols)+j]);

	//  mexPrintf("\n**************************************************\n\n");

	//}

	

	__syncthreads();

	i=0;

	//BACKWARD FILTER PASS ALONG ROWS for FILT1

	asum = (float)(1.0 + rowcoeffs[8] + rowcoeffs[9] + rowcoeffs[10] + rowcoeffs[11]);

	for (r=0; r < numRows; r++) { 

	M1 = M2 = M3 = M4 = (((float)(inimage[numRows*numCols-(i+r)-1]<=dgrey)) / asum); 

		for (c=0; c < numCols; c++) {

			dtmp = ((float)(inimage[(numRows*numCols)-(i+r)-1-(c*numRows)]<=dgrey));

			M0 =  (float)dtmp - (float)(rowcoeffs[8]*M1) - (float)(rowcoeffs[9]*M2) - (float)(rowcoeffs[10]*M3) - (float)(rowcoeffs[11]*M4);

			tmpimage[((Tx+1)*numRows*numCols)- (i+r)-1-(c*numRows)] += (float)(rowcoeffs[12]*M1) + (float)(rowcoeffs[13]*M2) + (float)(rowcoeffs[14]*M3) + (float)(rowcoeffs[15]*M4);

			M4 = M3; M3 = M2; M2 = M1; M1 = M0;	

	   }

	}	

	//if (Tx==0) {

	//	for (j=0;j<numRows*numCols;j=j+10000)

	//		mexPrintf("Pixel[%d] = %.40f.\n",j,tmpimage[(Tx*numRows*numCols)+j]);

	//	mexPrintf("\n**************************************************\n\n");

	//}

	

	//============================================================

=========

	//Rounds 3 and 4: Calculations for the second half of FILT1

	

	i=0;

	__syncthreads();

	//FORWARD FILTER PASS ALONG COLUMNS for FILT1

	asum = (float)(1.0 + colcoeffs[0] + colcoeffs[1] + colcoeffs[2] + colcoeffs[3]);

	for (c=0; c < numCols; c++) {

	M1 = M2 = M3 = M4 = (float(tmpimage[(Tx*numRows*numCols)+((i+c)*numRows)] / asum)); 

		for (r=0; r < numRows; r++) {

		   dtmp = ((float)tmpimage[(Tx*numRows*numCols)+((i+c)*numRows)+r]);

		   M0 = (float)dtmp - (float)(colcoeffs[0]*M1) - (float)(colcoeffs[1]*M2) - (float)(colcoeffs[2]*M3) - (float)(colcoeffs[3]*M4);

				   filt[(Tx*numRows*numCols)+((i+c)*numRows)+r] += (float)(colcoeffs[4]*M0) + (float)(colcoeffs[5]*M1) + (float)(colcoeffs[6]*M2) + (float)(colcoeffs[7]*M3); 

		   M4 = M3; M3 = M2; M2 = M1; M1 = M0;	

		}

	}

	

	//if (Tx==0) {

	//	for (j=0;j<numRows*numCols;j=j+10000)

	//		mexPrintf("Pixel[%d] = %.40f.\n",j,filt[(Tx*numRows*numCols)+j]);

	//	mexPrintf("\n**************************************************\n\n");

	//}

	__syncthreads();

	 i=0;

	//BACKWARD FILTER PASS ALONG COLUMNS for FILT1	  

	asum = (float)(1.0 + colcoeffs[8] + colcoeffs[9] + colcoeffs[10] + colcoeffs[11]);

	for (c=0; c < numCols; c++) {

	M1 = M2 = M3 = M4 = ((float)(tmpimage[((Tx+1)*numRows*numCols)-((i+c)*numRows)-1] / asum)); 

		for (r=0; r < numRows; r++) {

		   dtmp = ((float)tmpimage[((Tx+1)*numRows*numCols)-((i+c)*numRows)-r-1]);

		   M0 = (float)dtmp - (float)(colcoeffs[8]*M1) - (float)(colcoeffs[9]*M2) - (float)(colcoeffs[10]*M3) - (float)(colcoeffs[11]*M4);

				   filt[((Tx+1)*numRows*numCols)-((i+c)*numRows)-r-1] += (float)(colcoeffs[12]*M1) + (float)(colcoeffs[13]*M2) + (float)(colcoeffs[14]*M3) + (float)(colcoeffs[15]*M4);

		   M4 = M3; M3 = M2; M2 = M1; M1 = M0;	

		}

	}

	//if (Tx==0) {

	//  for (j=0;j<numRows*numCols;j=j+10000)

	//	  mexPrintf("Pixel[%d] = %.40f.\n",j,filt[(Tx*numRows*numCols)+j]);

	//  mexPrintf("\n**************************************************\n\n");

	//}

__syncthreads();

//============================================================

========= 

	

	//For resuse of variable tmpimage

	for (i=0; i<numRows*numCols; i++) {

			tmpimage[(Tx*numRows*numCols)+i] = 0;

	}

	//============================================================

=========

	

	__syncthreads();

	

	

	//============================================================

=========

	//Rounds 5 and 6: Calculations for the first half of FILT2

	i=0;	

	//FORWARD FILTER PASS ALONG ROWS for FILT2

	asum = (float)(1.0 + colcoeffs[0] + colcoeffs[1] + colcoeffs[2] + colcoeffs[3]);

	for (r=0; r < numRows; r++) {

		M1 = M2 = M3 = M4 = (((float)(inimage[i+r]<=dgrey)) / asum); 

		for (c=0; c < numCols; c++) {

		   dtmp = ((float)(inimage[(i+r)+c*numRows]<=dgrey));

	   	   M0 = (float)dtmp - (float)(colcoeffs[0]*M1) - (float)(colcoeffs[1]*M2) - (float)(colcoeffs[2]*M3) - (float)(colcoeffs[3]*M4);		   

				   tmpimage[(Tx*numRows*numCols)+c*numRows+(i+r)] += (float)(colcoeffs[4]*M0) + (float)(colcoeffs[5]*M1) + (float)(colcoeffs[6]*M2) + (float)(colcoeffs[7]*M3); 

		   M4 = M3; M3 = M2; M2 = M1; M1 = M0;	

		}

	}

	

	//if (Tx==0) {

	//	for (j=0;j<numRows*numCols;j=j+10000)

	//		mexPrintf("Pixel[%d] = %.40f.\n",j,tmpimage[(Tx*numRows*numCols)+j]);

	//	mexPrintf("\n**************************************************\n\n");

	//}

	__syncthreads();

	i=0;

	//BACKWARD FILTER PASS ALONG ROWS for FILT2

	asum = (float)(1.0 + colcoeffs[8] + colcoeffs[9] + colcoeffs[10] + colcoeffs[11]);

	for (r=0; r < numRows; r++) {

		M1 = M2 = M3 = M4 = (((float)(inimage[numRows*numCols-(i+r)-1]<=dgrey)) / asum); 

		for (c=0; c < numCols; c++) {

		   dtmp = ((float)(inimage[numRows*numCols-(i+r)-1-c*numRows]<=dgrey));

		   M0 = (float)dtmp - (float)(colcoeffs[8]*M1) - (float)(colcoeffs[9]*M2) - (float)(colcoeffs[10]*M3) - (float)(colcoeffs[11]*M4);

				   tmpimage[((Tx+1)*numRows*numCols)-(i+r)-1-(c*numRows)] += (float)(colcoeffs[12]*M1) + (float)(colcoeffs[13]*M2) + (float)(colcoeffs[14]*M3) + (float)(colcoeffs[15]*M4);

		   M4 = M3; M3 = M2; M2 = M1; M1 = M0;	

		}

	}	 

	//if (Tx==0) {

	//	for (j=0;j<numRows*numCols;j=j+10000)

	//	   mexPrintf("Pixel[%d] = %.40f.\n",j,tmpimage[(Tx*numRows*numCols)+j]);

	//	mexPrintf("\n**************************************************\n\n");

	//}

	//============================================================

=========

	//Rounds 7 and 8: Calculations for the second half of FILT2

	__syncthreads();

	i=0;

	//FORWARD FILTER PASS ALONG COLUMNS for FILT2

	asum = (float)(1.0 + rowcoeffs[0] + rowcoeffs[1] + rowcoeffs[2] + rowcoeffs[3]);

	for (c=0; c < numCols; c++) {

	M1 = M2 = M3 = M4 = ((float)(tmpimage[(Tx*numRows*numCols)+((i+c)*numRows)] / asum)); 

		for (r=0; r < numRows; r++) {

		   dtmp = ((float)tmpimage[(Tx*numRows*numCols)+((i+c)*numRows)+r]);

		   M0 = (float)dtmp - (float)(rowcoeffs[0]*M1) - (float)(rowcoeffs[1]*M2) - (float)(rowcoeffs[2]*M3) - (float)(rowcoeffs[3]*M4);

				   filt[(Tx*numRows*numCols)+((i+c)*numRows)+r] += (float)(rowcoeffs[4]*M0) + (float)(rowcoeffs[5]*M1) + (float)(rowcoeffs[6]*M2) + (float)(rowcoeffs[7]*M3);

		   M4 = M3; M3 = M2; M2 = M1; M1 = M0;	

		}

	}

	//if (Tx==0) {

	// for (j=0;j<numRows*numCols;j=j+10000)

	//   	mexPrintf("Pixel[%d] = %.40f.\n",j,filt[(Tx*numRows*numCols)+j]);

	// mexPrintf("\n**************************************************\n\n");

	//}

	

	__syncthreads();

		i=0;

	//BACKWARD FILTER PASS ALONG COLUMNS for FILT2   

	asum = (float)(1.0 + rowcoeffs[8] + rowcoeffs[9] + rowcoeffs[10] + rowcoeffs[11]); 

	for (c=0; c < numCols; c++) {

	M1 = M2 = M3 = M4 = ((float)(tmpimage[((Tx+1)*numRows*numCols)-((i+c)*numRows)-1] / asum)); 

		for (r=0; r < numRows; r++) {

		   dtmp = ((float)tmpimage[((Tx+1)*numRows*numCols)-((i+c)*numRows)-r-1]);

		   M0 = (float)dtmp - (float)(rowcoeffs[8]*M1) - (float)(rowcoeffs[9]*M2) - (float)(rowcoeffs[10]*M3) - (float)(rowcoeffs[11]*M4);

				   filt[((Tx+1)*numRows*numCols)-((i+c)*numRows)-r-1] += (float)(rowcoeffs[12]*M1) + (float)(rowcoeffs[13]*M2) + (float)(rowcoeffs[14]*M3) + (float)(rowcoeffs[15]*M4);

		   M4 = M3; M3 = M2; M2 = M1; M1 = M0;	

		}

	} 

	

	//if (Tx==0) {

	//	for (j=0;j<numRows*numCols;j=j+10000)

	//		mexPrintf("Pixel[%d] = %.40f.\n",j,filt[(Tx*numRows*numCols)+j]);

	//	mexPrintf("\n**************************************************\n\n");

	//}

	

	//if (Tx==0) {

	   //  for (i=0; i<numRows*numCols; i++) {

	   //	  scoreim[i] = filt[(Tx*numRows*numCols)+i];

	//  }

	//}

	__syncthreads();

	//============================================================

=========

	//LAPLACIAN IMAGE IS THEN FILT1+FILT2

	//accumulate abs value into output score array

	dtmp = 0.0;

	

	if (Tx==0) {

	// deliberately serialized to avoid memory access issues

		for (j=0; j < numRows*numCols; j++) {

		

			scoreim[j] = 0.0f;

	

			for (i=0; i < BIN_QTY; i++){

				dtmp = (float)(filt[i*numRows*numCols + j]);

				dtmp = float((dtmp < 0.0) ? -(dtmp) : dtmp);

				scoreim[j] += (float)dtmp;			

			}

			

		}

	}

	

	

	//__syncthreads(); 

	

	

	//if(Tx==0) {

	//	for (i=0;i<numRows*numCols;i=i+10000)

	//		mexPrintf("Pixel[%d] = %.40f\n",i,scoreim[i]);

	//	mexPrintf("\n**************************************************\n\n");

	//}

	

	return;

}   //END OF GREY(float)SCALE CHANNEL FOR LOOP

/* --------------------------------------------- Host's (CPU) Main( ) Code ----------------------------------- */

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

{

	if(!InitCUDA()) {

			return;

		}

		

	//All code and internal function calls go in here!	

	//Declarations

	float *inimage_h, *scoreim_h, *rowcoeffs_h, *colcoeffs_h;

	//float *inimage, *scoreim, *tmpimage, *filt;

	//float *rowcoeffs, *colcoeffs;

	//int numRows, numCols;

	int i, numPix;

	const int *dims;

	

	// Get input image and co-efficient arrays 

	if (!mxIsSingle(prhs[0])) {

		mexErrMsgTxt("Input image should be single-precision float.\n");

	}

	inimage_h = (float*)mxGetPr(prhs[0]);

	dims = mxGetDimensions(prhs[0]);

	numRows = dims[0];

	numCols = dims[1];

	numPix = numRows*numCols;

	rowcoeffs_h = (float*)mxGetPr(prhs[1]); 

	dims = mxGetDimensions(prhs[1]); 

	if (dims[0]*dims[1] != 16)

	   mexErrMsgTxt("dericheIIR2d: rowcoeffs must be a vector of length 16\n");

	colcoeffs_h = (float*)mxGetPr(prhs[2]);

	dims = mxGetDimensions(prhs[2]);

	if (dims[0]*dims[1] != 16)

	   mexErrMsgTxt("dericheIIR2d: colcoeffs must be a vector of length 16\n");	

	

	//copy the input image and co-efficient arrays onto the GPU (i.e. device)

	//cudaMalloc((void**)&inimage,sizeof(float)*numPix*BIN_QTY);

	cudaMalloc((void**)&inimage,sizeof(float)*numPix);

	cudaMalloc((void**)&scoreim,sizeof(float)*numPix);

	cudaMalloc((void**)&rowcoeffs,sizeof(float)*16);

	cudaMalloc((void**)&colcoeffs,sizeof(float)*16);

	cudaMalloc((void**)&tmpimage,sizeof(float)*numPix*BIN_QTY);

	cudaMalloc((void**)&filt,sizeof(float)*numPix*BIN_QTY);

	

	

	//cudaMemset(inimage,0,sizeof(float)*numPix*BIN_QTY);

	cudaMemset(inimage,0,sizeof(float)*numPix);

	cudaMemset(scoreim,0,sizeof(float)*numPix);

	cudaMemset(rowcoeffs,0,sizeof(float)*16);

	cudaMemset(colcoeffs,0,sizeof(float)*16);

	cudaMemset(tmpimage,0,sizeof(float)*numPix*BIN_QTY);

	cudaMemset(filt,0,sizeof(float)*numPix*BIN_QTY);

	cudaMemcpy(inimage, inimage_h, sizeof(float)*numPix, cudaMemcpyHostToDevice);

	cudaMemcpy(rowcoeffs, rowcoeffs_h, sizeof(float)*16, cudaMemcpyHostToDevice);

	cudaMemcpy(colcoeffs, colcoeffs_h, sizeof(float)*16, cudaMemcpyHostToDevice);

	

	// Set-up the execution configuration

	dim3 dimGrid(1,1); 			/* the grid dimensions in number of blocks */ 

	dim3 dimBlock(BIN_QTY,1);			/* the block dimensions in number of threads */

	// Launch a kernel of threads to perform some function on the Device

	ProcessFrame_DeviceKernel<<<dimGrid,dimBlock>>>(inimage,scoreim,tmpimage,filt,rowcoeffs,colcoeffs, numRows, numCols);

	

	

	//create output image, same size as input image

	plhs[0] = mxCreateNumericMatrix(numRows, numCols, mxSINGLE_CLASS, mxREAL); //mxReal is our data-type

	mxSetM(plhs[0],numRows);

	mxSetN(plhs[0],numCols);

	scoreim_h = (float*)mxGetPr(plhs[0]); 

	cudaMemcpy(scoreim_h, scoreim, sizeof(float)*numPix, cudaMemcpyDeviceToHost);

	for (i=0;i<numRows*numCols;i=i+10000)

		printf("Pixel[%d] = %.40f\n",i,scoreim_h[i]);

	printf("\n**************************************************\n\n");

	

	cudaFree(inimage);

	cudaFree(scoreim);

	cudaFree(tmpimage);

	cudaFree(filt);

	cudaFree(rowcoeffs);

	cudaFree(colcoeffs);

	return;

}

[code]