Difference between Device emulation and execution modes

Hi Everyone,

I have been stuck with this problem for long…trying out everything possible…will be really really thankful if someone could help!! I will describe my problem here in points:

  1. I am using GeForce 9800 GTX+ on a 32-bit Windows-XP machine. I am using C with CUDA 2.0. Half the project is in matlab and half in C (the GPU kernel is in C ofcourse) and I am using nvmex functions/conversions. I using Matlab R2008a and VS2005 (8.0). The underlying compiler is the VS compiler.

  2. I have this very simple .cu code which is some 8 convolutions over an image and then addition of these 8 convoluted images…not very compute intensive…just 64 adds and 64+256 multiplies per pixel.

  3. Outputs from GPU emulation and execution do not match. I wouldn’t say they are completely and unacceptably different…but whatever little accuracy I get with GPU execution is just not sufficient for my purpose. So I have been trying a lot of different things to check for the possible error.

  4. I completely simplified the code…removed all shared memory accesses. Everything is loaded and stored in global memory and registers. I have forgotten the basic aim of speed-up…all I am concentrating on is accuracy. I have simplified the code to a very great extent.

  5. I am not using double at all. Also, I type-casted all intermediate operations / results explicitly as floats so that I can eliminate use of extended precision on CPU (emulation mode). The emulation mode result has moved away from the original reference…but still its not close to what the GPU execution is giving.

  6. I wrote a function to check if the CPU was handling any denormalized numbers in the emulation mode leading to higher accuracy…and if yes, then kill the denormalized number to zero and thus behave more like the GPU. But I am very convinced that the CPU is not coming across any denormalized numbers in the entire course of execution of the application.

  7. I used __fadd_rn and __fmul_rn to decompose FMAD. It negligibly reduced the relative error. I even tried changing the sequence of floating point arithmetic operations [(a+cool.gif+(c+d), (a+cool.gif+c)+d), etc.]…but did not help. I tried even Kahan Summation to increase floating point accuracy…but no improvement.

  8. I am sure about the CUDA/GPU initialization…I had picked that routine up from one of the sample codes in the SDK…have already tried it with other test codes that I had written before. CUDA error handling never gave any errors. I can’t use error handling in this project since nvmex doesn’t allow that…but I am very sure that none of the CUDA system calls are creating an issue…I have tested by transferring intermediate results at all steps back to CPU and printed the results from there…and never found any discrepancy in whatever is exchanged between the CPU and the GPU in the execution mode.

  9. I tried with 1block-1thread and also with 1block-32threads kernel configurations to check if there was a race condition or memory access issue…but nothing seems to improve the GPU results.

  10. When I write simple test codes for some say 10000 floating point additions in a loop, the results are the same from both execution and emulation. This makes me feel that there is some small thing that I am missing. Its the same piece of code compiled for emulation and execution separately sad.gif sad.gif

I tried everything I could think of (more than what I have described above)…to improve GPU execution results and get them close to the emulation results from the CPU. When that did not help, I tried to remove the discrepancy between CPU and GPU from the CPU (killing denormalized numbers, strict single precision arithmetic, etc.) to see if the CPU results degrade and come close to the execution results. Nothing seems to change things drastically and get the two results even within acceptable error limits. I have checked things at intermediate steps too…error gradually builds up…but it begins with a big error in the first place.

I am saturated…don’t know what more to try. I am sharing my code (.cu part) below if someone wants to see it (I keep changing the constant BIN_QTY to run for different number of threads). Please suggest. I cannot get funds to buy a higher end equipment (say 200GTX that supports double precision) till I have proved that its the device’s fault and not mine. I don’t want to be stuck in the same problem even with an expensive equipment.

[/code]

/************************************************************

********

  This CUDA program deals with only the convolutions part of CSDD

************************************************************

*********/

#include <stdio.h>

#include <stdlib.h>

#include "math.h"

#include "matrix.h"

#include "mex.h"

#include <cuda.h>

#include "cuda_runtime.h"

#include "cuda_runtime_api.h" 

#include "cuda.h"

#include "math_functions.h"

#define BIN_QTY 32

/************************************************************

************/

/* Init CUDA															*/

/************************************************************

************/

#if __DEVICE_EMULATION__

bool InitCUDA(void){return true;}

#else

bool InitCUDA(void)

{

	int count = 0;

	int i = 0;

	cudaGetDeviceCount(&count);

	if(count == 0) {

		fprintf(stderr, "There is no device.\n");

		return false;

	}

	for(i = 0; i < count; i++) {

		cudaDeviceProp prop;

		if(cudaGetDeviceProperties(&prop, i) == cudaSuccess) {

			if(prop.major >= 1) {

				break;

			}

		}

	}

	if(i == count) {

		fprintf(stderr, "There is no device supporting CUDA.\n");

		return false;

	}

	cudaSetDevice(i);

	printf("CUDA initialized.\n");

	return true;

}

#endif

/************************************************************

******************/

/* The Code Begins - Equivalent of dericheCSDEMD(inimage,gaucoeffs,gau2coeffs)*/

/************************************************************

******************/

__device__ float *inimage, *scoreim, *tmpimage, *filt;

__device__ float *rowcoeffs, *colcoeffs;

__device__ int numRows, numCols;

extern "C"

__global__ static void ProcessFrame_DeviceKernel(float* inimage, float* scoreim, float* tmpimage, float* filt, float* rowcoeffs, float* colcoeffs, int numRows, int numCols)

{

	int i,Tx;

		

	// 1D Thread ID

	Tx = threadIdx.x;

	

	__syncthreads();

//============================================================

=========

	//STEP THROUGH 128 GREY(float)SCALE CHANNELS 0-1, 2-3, ..., 254-255 AND DO

	//LAPLACIAN OF GAUSSIAN CSD FILTERING, ACCUMULATING RESULT INTO SCOREIM

	int grey = 2*Tx+1;

	float dgrey = (float)((float)grey/(float)256.0);

	int j,r,c;

	float asum, dtmp;

	float M0,M1,M2,M3,M4;			  //memory registers

	//============================================================

=========

	//Rounds 1 and 2: Calculations for the first half of FILT1

	__syncthreads();

	i=0;	

	//FORWARD FILTER PASS ALONG ROWS for FILT1	  

	asum = (float)(1.0 + rowcoeffs[0] + rowcoeffs[1] + rowcoeffs[2] + rowcoeffs[3]);

	for (r=0; r < numRows; r++) {

	M1 = M2 = M3 = M4 = (((float)(inimage[i+r]<=dgrey)) / asum);

		for (c=0; c < numCols; c++) {

			dtmp = ((float)(inimage[(i+r)+(c*numRows)]<=dgrey));

			M0 =  (float)dtmp - (float)(rowcoeffs[0]*M1) - (float)(rowcoeffs[1]*M2) - (float)(rowcoeffs[2]*M3) - (float)(rowcoeffs[3]*M4); 

			tmpimage[(Tx*numRows*numCols)+(i+r)+(c*numRows)] += (float)(rowcoeffs[4]*M0) + (float)(rowcoeffs[5]*M1) + (float)(rowcoeffs[6]*M2) + (float)(rowcoeffs[7]*M3);

			M4 = M3; M3 = M2; M2 = M1; M1 = M0;

		}	

	}																																														

	//if (Tx==0) {

	//  for (j=0;j<numRows*numCols;j=j+10000)

	//		mexPrintf("Pixel[%d] = %.40f.\n",j,tmpimage[(Tx*numRows*numCols)+j]);

	//  mexPrintf("\n**************************************************\n\n");

	//}

	

	__syncthreads();

	i=0;

	//BACKWARD FILTER PASS ALONG ROWS for FILT1

	asum = (float)(1.0 + rowcoeffs[8] + rowcoeffs[9] + rowcoeffs[10] + rowcoeffs[11]);

	for (r=0; r < numRows; r++) { 

	M1 = M2 = M3 = M4 = (((float)(inimage[numRows*numCols-(i+r)-1]<=dgrey)) / asum); 

		for (c=0; c < numCols; c++) {

			dtmp = ((float)(inimage[(numRows*numCols)-(i+r)-1-(c*numRows)]<=dgrey));

			M0 =  (float)dtmp - (float)(rowcoeffs[8]*M1) - (float)(rowcoeffs[9]*M2) - (float)(rowcoeffs[10]*M3) - (float)(rowcoeffs[11]*M4);

			tmpimage[((Tx+1)*numRows*numCols)- (i+r)-1-(c*numRows)] += (float)(rowcoeffs[12]*M1) + (float)(rowcoeffs[13]*M2) + (float)(rowcoeffs[14]*M3) + (float)(rowcoeffs[15]*M4);

			M4 = M3; M3 = M2; M2 = M1; M1 = M0;	

	   }

	}	

	//if (Tx==0) {

	//	for (j=0;j<numRows*numCols;j=j+10000)

	//		mexPrintf("Pixel[%d] = %.40f.\n",j,tmpimage[(Tx*numRows*numCols)+j]);

	//	mexPrintf("\n**************************************************\n\n");

	//}

	

	//============================================================

=========

	//Rounds 3 and 4: Calculations for the second half of FILT1

	

	i=0;

	__syncthreads();

	//FORWARD FILTER PASS ALONG COLUMNS for FILT1

	asum = (float)(1.0 + colcoeffs[0] + colcoeffs[1] + colcoeffs[2] + colcoeffs[3]);

	for (c=0; c < numCols; c++) {

	M1 = M2 = M3 = M4 = (float(tmpimage[(Tx*numRows*numCols)+((i+c)*numRows)] / asum)); 

		for (r=0; r < numRows; r++) {

		   dtmp = ((float)tmpimage[(Tx*numRows*numCols)+((i+c)*numRows)+r]);

		   M0 = (float)dtmp - (float)(colcoeffs[0]*M1) - (float)(colcoeffs[1]*M2) - (float)(colcoeffs[2]*M3) - (float)(colcoeffs[3]*M4);

		   filt[(Tx*numRows*numCols)+((i+c)*numRows)+r] += (float)(colcoeffs[4]*M0) + (float)(colcoeffs[5]*M1) + (float)(colcoeffs[6]*M2) + (float)(colcoeffs[7]*M3); 

		   M4 = M3; M3 = M2; M2 = M1; M1 = M0;	

		}

	}

	

	//if (Tx==0) {

	//	for (j=0;j<numRows*numCols;j=j+10000)

	//		mexPrintf("Pixel[%d] = %.40f.\n",j,filt[(Tx*numRows*numCols)+j]);

	//	mexPrintf("\n**************************************************\n\n");

	//}

	__syncthreads();

	i=0;

	//BACKWARD FILTER PASS ALONG COLUMNS for FILT1	  

	asum = (float)(1.0 + colcoeffs[8] + colcoeffs[9] + colcoeffs[10] + colcoeffs[11]);

	for (c=0; c < numCols; c++) {

	M1 = M2 = M3 = M4 = ((float)(tmpimage[((Tx+1)*numRows*numCols)-((i+c)*numRows)-1] / asum)); 

		for (r=0; r < numRows; r++) {

		   dtmp = ((float)tmpimage[((Tx+1)*numRows*numCols)-((i+c)*numRows)-r-1]);

		   M0 = (float)dtmp - (float)(colcoeffs[8]*M1) - (float)(colcoeffs[9]*M2) - (float)(colcoeffs[10]*M3) - (float)(colcoeffs[11]*M4);

		   filt[((Tx+1)*numRows*numCols)-((i+c)*numRows)-r-1] += (float)(colcoeffs[12]*M1) + (float)(colcoeffs[13]*M2) + (float)(colcoeffs[14]*M3) + (float)(colcoeffs[15]*M4);

		   M4 = M3; M3 = M2; M2 = M1; M1 = M0;	

		}

	}

	//if (Tx==0) {

	//  for (j=0;j<numRows*numCols;j=j+10000)

	//	  mexPrintf("Pixel[%d] = %.40f.\n",j,filt[(Tx*numRows*numCols)+j]);

	//  mexPrintf("\n**************************************************\n\n");

	//}

__syncthreads();

//============================================================

========= 

	

	//For resuse of variable tmpimage

	for (i=0; i<numRows*numCols; i++) {

			tmpimage[(Tx*numRows*numCols)+i] = 0;

	}

	//============================================================

=========

	

	__syncthreads();

	

	

	//============================================================

=========

	//Rounds 5 and 6: Calculations for the first half of FILT2

	i=0;	

	//FORWARD FILTER PASS ALONG ROWS for FILT2

	asum = (float)(1.0 + colcoeffs[0] + colcoeffs[1] + colcoeffs[2] + colcoeffs[3]);

	for (r=0; r < numRows; r++) {

		M1 = M2 = M3 = M4 = (((float)(inimage[i+r]<=dgrey)) / asum); 

		for (c=0; c < numCols; c++) {

		   dtmp = ((float)(inimage[(i+r)+c*numRows]<=dgrey));

		   M0 = (float)dtmp - (float)(colcoeffs[0]*M1) - (float)(colcoeffs[1]*M2) - (float)(colcoeffs[2]*M3) - (float)(colcoeffs[3]*M4);		   

		   tmpimage[(Tx*numRows*numCols)+c*numRows+(i+r)] += (float)(colcoeffs[4]*M0) + (float)(colcoeffs[5]*M1) + (float)(colcoeffs[6]*M2) + (float)(colcoeffs[7]*M3); 

		   M4 = M3; M3 = M2; M2 = M1; M1 = M0;	

		}

	}

	

	//if (Tx==0) {

	//	for (j=0;j<numRows*numCols;j=j+10000)

	//		mexPrintf("Pixel[%d] = %.40f.\n",j,tmpimage[(Tx*numRows*numCols)+j]);

	//	mexPrintf("\n**************************************************\n\n");

	//}

	__syncthreads();

	i=0;

	//BACKWARD FILTER PASS ALONG ROWS for FILT2

	asum = (float)(1.0 + colcoeffs[8] + colcoeffs[9] + colcoeffs[10] + colcoeffs[11]);

	for (r=0; r < numRows; r++) {

		M1 = M2 = M3 = M4 = (((float)(inimage[numRows*numCols-(i+r)-1]<=dgrey)) / asum); 

		for (c=0; c < numCols; c++) {

		   dtmp = ((float)(inimage[numRows*numCols-(i+r)-1-c*numRows]<=dgrey));

		   M0 = (float)dtmp - (float)(colcoeffs[8]*M1) - (float)(colcoeffs[9]*M2) - (float)(colcoeffs[10]*M3) - (float)(colcoeffs[11]*M4);

		   tmpimage[((Tx+1)*numRows*numCols)-(i+r)-1-(c*numRows)] += (float)(colcoeffs[12]*M1) + (float)(colcoeffs[13]*M2) + (float)(colcoeffs[14]*M3) + (float)(colcoeffs[15]*M4);

		   M4 = M3; M3 = M2; M2 = M1; M1 = M0;	

		}

	}	 

	//if (Tx==0) {

	//	for (j=0;j<numRows*numCols;j=j+10000)

	//	   mexPrintf("Pixel[%d] = %.40f.\n",j,tmpimage[(Tx*numRows*numCols)+j]);

	//	mexPrintf("\n**************************************************\n\n");

	//}

	//============================================================

=========

	//Rounds 7 and 8: Calculations for the second half of FILT2

	__syncthreads();

	i=0;

	//FORWARD FILTER PASS ALONG COLUMNS for FILT2

	asum = (float)(1.0 + rowcoeffs[0] + rowcoeffs[1] + rowcoeffs[2] + rowcoeffs[3]);

	for (c=0; c < numCols; c++) {

	M1 = M2 = M3 = M4 = ((float)(tmpimage[(Tx*numRows*numCols)+((i+c)*numRows)] / asum)); 

		for (r=0; r < numRows; r++) {

		   dtmp = ((float)tmpimage[(Tx*numRows*numCols)+((i+c)*numRows)+r]);

		   M0 = (float)dtmp - (float)(rowcoeffs[0]*M1) - (float)(rowcoeffs[1]*M2) - (float)(rowcoeffs[2]*M3) - (float)(rowcoeffs[3]*M4);

		   filt[(Tx*numRows*numCols)+((i+c)*numRows)+r] += (float)(rowcoeffs[4]*M0) + (float)(rowcoeffs[5]*M1) + (float)(rowcoeffs[6]*M2) + (float)(rowcoeffs[7]*M3);

		   M4 = M3; M3 = M2; M2 = M1; M1 = M0;	

		}

	}

	//if (Tx==0) {

	// for (j=0;j<numRows*numCols;j=j+10000)

	//   	mexPrintf("Pixel[%d] = %.40f.\n",j,filt[(Tx*numRows*numCols)+j]);

	// mexPrintf("\n**************************************************\n\n");

	//}

	

	__syncthreads();

	i=0;

	//BACKWARD FILTER PASS ALONG COLUMNS for FILT2   

	asum = (float)(1.0 + rowcoeffs[8] + rowcoeffs[9] + rowcoeffs[10] + rowcoeffs[11]); 

	for (c=0; c < numCols; c++) {

	M1 = M2 = M3 = M4 = ((float)(tmpimage[((Tx+1)*numRows*numCols)-((i+c)*numRows)-1] / asum)); 

		for (r=0; r < numRows; r++) {

		   dtmp = ((float)tmpimage[((Tx+1)*numRows*numCols)-((i+c)*numRows)-r-1]);

		   M0 = (float)dtmp - (float)(rowcoeffs[8]*M1) - (float)(rowcoeffs[9]*M2) - (float)(rowcoeffs[10]*M3) - (float)(rowcoeffs[11]*M4);

		   filt[((Tx+1)*numRows*numCols)-((i+c)*numRows)-r-1] += (float)(rowcoeffs[12]*M1) + (float)(rowcoeffs[13]*M2) + (float)(rowcoeffs[14]*M3) + (float)(rowcoeffs[15]*M4);

		   M4 = M3; M3 = M2; M2 = M1; M1 = M0;	

		}

	} 

	

	//if (Tx==0) {

	//	for (j=0;j<numRows*numCols;j=j+10000)

	//		mexPrintf("Pixel[%d] = %.40f.\n",j,filt[(Tx*numRows*numCols)+j]);

	//	mexPrintf("\n**************************************************\n\n");

	//}

	

	//if (Tx==0) {

	//  for (i=0; i<numRows*numCols; i++) {

	//	  scoreim[i] = filt[(Tx*numRows*numCols)+i];

	//  }

	//}

	__syncthreads();

	//============================================================

=========

	//LAPLACIAN IMAGE IS THEN FILT1+FILT2

	//accumulate abs value into output score array

	dtmp = 0.0;

	

	if (Tx==0) {

	// deliberately serialized to avoid memory access issues

		for (j=0; j < numRows*numCols; j++) {

		

			scoreim[j] = 0.0f;

	

			for (i=0; i < BIN_QTY; i++){

				dtmp = (float)(filt[i*numRows*numCols + j]);

				dtmp = float((dtmp < 0.0) ? -(dtmp) : dtmp);

				scoreim[j] += (float)dtmp;			

			}

			

		}

	}

	

	

	//__syncthreads(); 

	

	

	//if(Tx==0) {

	//	for (i=0;i<numRows*numCols;i=i+10000)

	//		mexPrintf("Pixel[%d] = %.40f\n",i,scoreim[i]);

	//	mexPrintf("\n**************************************************\n\n");

	//}

	

	return;

}   //END OF GREY(float)SCALE CHANNEL FOR LOOP

/* --------------------------------------------- Host's (CPU) Main( ) Code ----------------------------------- */

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

{

	if(!InitCUDA()) {

			return;

		}

		

	//All code and internal function calls go in here!	

	//Declarations

	float *inimage_h, *scoreim_h, *rowcoeffs_h, *colcoeffs_h;

	//float *inimage, *scoreim, *tmpimage, *filt;

	//float *rowcoeffs, *colcoeffs;

	//int numRows, numCols;

	int i, numPix;

	const int *dims;

	

	// Get input image and co-efficient arrays 

	if (!mxIsSingle(prhs[0])) {

		mexErrMsgTxt("Input image should be single-precision float.\n");

	}

	inimage_h = (float*)mxGetPr(prhs[0]);

	dims = mxGetDimensions(prhs[0]);

	numRows = dims[0];

	numCols = dims[1];

	numPix = numRows*numCols;

	rowcoeffs_h = (float*)mxGetPr(prhs[1]); 

	dims = mxGetDimensions(prhs[1]); 

	if (dims[0]*dims[1] != 16)

	   mexErrMsgTxt("dericheIIR2d: rowcoeffs must be a vector of length 16\n");

	colcoeffs_h = (float*)mxGetPr(prhs[2]);

	dims = mxGetDimensions(prhs[2]);

	if (dims[0]*dims[1] != 16)

	   mexErrMsgTxt("dericheIIR2d: colcoeffs must be a vector of length 16\n");	

	

	//copy the input image and co-efficient arrays onto the GPU (i.e. device)

	//cudaMalloc((void**)&inimage,sizeof(float)*numPix*BIN_QTY);

	cudaMalloc((void**)&inimage,sizeof(float)*numPix);

	cudaMalloc((void**)&scoreim,sizeof(float)*numPix);

	cudaMalloc((void**)&rowcoeffs,sizeof(float)*16);

	cudaMalloc((void**)&colcoeffs,sizeof(float)*16);

	cudaMalloc((void**)&tmpimage,sizeof(float)*numPix*BIN_QTY);

	cudaMalloc((void**)&filt,sizeof(float)*numPix*BIN_QTY);

	

	

	//cudaMemset(inimage,0,sizeof(float)*numPix*BIN_QTY);

	cudaMemset(inimage,0,sizeof(float)*numPix);

	cudaMemset(scoreim,0,sizeof(float)*numPix);

	cudaMemset(rowcoeffs,0,sizeof(float)*16);

	cudaMemset(colcoeffs,0,sizeof(float)*16);

	cudaMemset(tmpimage,0,sizeof(float)*numPix*BIN_QTY);

	cudaMemset(filt,0,sizeof(float)*numPix*BIN_QTY);

	cudaMemcpy(inimage, inimage_h, sizeof(float)*numPix, cudaMemcpyHostToDevice);

	cudaMemcpy(rowcoeffs, rowcoeffs_h, sizeof(float)*16, cudaMemcpyHostToDevice);

	cudaMemcpy(colcoeffs, colcoeffs_h, sizeof(float)*16, cudaMemcpyHostToDevice);

	

	// Set-up the execution configuration

	dim3 dimGrid(1,1); 					/* the grid dimensions in number of blocks */ 

	dim3 dimBlock(BIN_QTY,1);			/* the block dimensions in number of threads */

	// Launch a kernel of threads to perform some function on the Device

	ProcessFrame_DeviceKernel<<<dimGrid,dimBlock>>>(inimage,scoreim,tmpimage,filt,rowcoeffs,colcoeffs, numRows, numCols);

	

	

	//create output image, same size as input image

	plhs[0] = mxCreateNumericMatrix(numRows, numCols, mxSINGLE_CLASS, mxREAL); //mxReal is our data-type

	mxSetM(plhs[0],numRows);

	mxSetN(plhs[0],numCols);

	scoreim_h = (float*)mxGetPr(plhs[0]); 

	cudaMemcpy(scoreim_h, scoreim, sizeof(float)*numPix, cudaMemcpyDeviceToHost);

	for (i=0;i<numRows*numCols;i=i+10000)

		printf("Pixel[%d] = %.40f\n",i,scoreim_h[i]);

	printf("\n**************************************************\n\n");

	

	cudaFree(inimage);

	cudaFree(scoreim);

	cudaFree(tmpimage);

	cudaFree(filt);

	cudaFree(rowcoeffs);

	cudaFree(colcoeffs);

	return;

}

[code]

PS: Sorry to post this again…the last post somehow did not give me an option to edit after once.

Out of curiosity, can you quantify this? What is the max and average difference between CPU and GPU results?

It is really quite difficult to force the CPU to use 32-bit precision for floats. I don’t think type-casting intermediate operations is sufficient. With gcc, you have to use the -ffloat-store option to force the intermediate values to be written out to memory (and therefore discard the extended precision bits in the registers). I’m not sure what the equivalent option is for Visual Studio, though.

The avg. relative error is ~30%. At the end of the code, the accumulated error starts at the fourth or fifth place after decimal. But initially the error starts at the sixth place after decimal. The magnitude of numbers being dealt with is pretty small…but scaling them up also doesn’t seem to help a lot.

You are right…typecasting doesn’t completele ensure. Different compiler options have been suggested in the programming guide too…and I think I kind of know how to use them with visual studio. But my problem is, with nvmex, I use matlab to create and regulate the project…so there is no VS solution or project created as such…and so I do not know how to apply these provisions. Also, the output after typecasting was not promising enough for me to try anything further in this direction.

I request, if someone can share similar experiences and error figures, solutions, etc.

inimage_h = (float*)mxGetPr(prhs[0]);

This does not convert your double input to a float input. It just typecasts. Check the matlab plugin code for examples how to convert your matlab inputs to floats (and back). Code snippet below:

void convert_double2float( double *input_double, float *output_float,int Ntot)

{

	int i;

	for (i = 0; i < Ntot; i++)

	{

				output_float[i] = (float) input_double[i];

	}

}

void convert_float2double( float *input_float, double *output_double,int Ntot)

{

	int i;

	for (i = 0; i < Ntot; i++)

	{

				output_double[i] = (double) input_float[i];

	}

}

Also (float) 256.0 is better written as 256.0f. The first code might cause a double value to be cast to float on devices that support doubles.

Thanks for your reply. Here are my observations on this:

  1. The moment I use a double variable in my .cu file and run it, either the result is crap or my matlab session crashes…even in the device emulation mode…may be because my device doesn’t support double.

  2. I convert the input into “single” using the “cast(,‘single’)" function in Matlab and then pass it onto .cu. I re-"cast(,‘double’)” the output in the Matlab to “double”. If I do :

for (i=0;i<N;i++)

my_float[i] = (float)my_double[i];

the matlab session crashes.

  1. I referred to http://www.cs.utah.edu/~wkjeong/publicatio…_cuda_final.pdf and http://developer.download.nvidia.com/compu…with%20CUDA.pdf. I have followed the technique for the input and the ouput in the .m file because I cannot do it in .cu file, as explained above. I cast the input into ‘single’ as the last step in MATLAB before entering .cu. I do not know what is exactly meant by “the result is transformed back to double precision before it is returned to MATLAB”. I cast the result into double as the first step after re-entering MATLAB…it is only thereafter processed in MATLAB.

  2. I have checked the values of the input in both Matlab (after casting into single) and the .cu code (after the nvmex functions that take the input as floats)…and both of them are identical. So I will doubt that if this conversion is making a difference in the accuracy.

Thanks for this input…I have tried a lot of trial-and-error to figure this out but was never sure.

Thanks,

Aditi

I have one more observation to share. I tried running some of the sample codes in the SDK in device emulation and execution modes. Most codes look into only speed-up comparisons. But in one or two of them (eg. BlackScholes), there was a difference in L1 and L2 errors/values between emulation and execution. Does this indicate something about the accuracy of the underlying GPU hardware? Has someone tried anything more with the SDK in this direction?

Ahh. missed that part. And since I always do the conversion in C (I make too many mistakes and am bound to forget to cast in matlab once ;)) I missed the conversion in your code.

There is a whole appendix in the programming guide about the precision of each operator on GPU hardware. There can certainly be differences caused by that too. But it looks like it is mostly a different order of operators that is causing differences.

Personally I am tracking down a rather large difference between CUDA and matlab. And it seems like a number very close to pi is subtracted from pi. Followed by another function (which is quite steep around zero) so a small, small (1e-9 a 1e-10 absolute error for the number above) results in a large difference in the output (0.9 versus 0.7).

Fortunately I have double precision hardware, only my attempts at doublifying the code in question leads to NaNs at this time ;)…

I did one more test. I tried to use the CUBLAS library in one of my other similar compute intensive codes. The idea was that it will test the underlying hardware without getting into much nuances of coding with CUDA which could possibly introduce an error between the emulation and execution modes. Surprisingly with the CUBLAS library, the GPU execution and CPU results match…but GPU is taking more time than CPU (may be because there is not a lot of computation between the two CUBLAS function calls…its just a dot product of two 2200 elements’ vectors…so lot of overhead). Can I conclude something out of this? Shall I assume that there is problem with my coding in the first code which I have discussed earlier?

Please reply someone…its too long since I am stuck in this issue :(

Thanks,

Aditi

Hi All,

I have this problem solved now by moving my code to GTX280 that supports double precision and using double precision variables. Thus my conclusion was right that it was due to the hardware and not any error in my cuda code. I now not only get fair amount of accuracy but also 6x speed-up with my basic code without any optimizations. My thanks to all those who replied and tried to help.

Thanks,

Aditi