INACCURACY OF FLOAT DATA TYPE FLOAT DATA TYPE BECOME INACCURATE NEAR ABOUT 2^15

Hi,

I am learning CUDA.

I’ve used float data type for the arrays.
When running the program of scan using multiple blocks for array of any size,
I noticed that the results are inaccurate at the last digit External Image when the size of array is 8192.
But when the size of array is 131072, :angry: surprised to see that the results are inaccurate at the last 4 digits.
When i used double data type on VISUAL STUDIO (emulation mode) the results are accurate.
But when using it on nVidia’s “GeForce 9600 GT”, flashes an error: “warning : Double is not supported. Demoting to float”.

Any suggestions to get accurate results!!

You may need to learn a few things about floats. But please excuse if I mention

something that is already obvious to you.

a + b and b + a usually give different results (float opperations are non-associative),

so when you compute something in parallel using parallel reduction or with a scan,

results naturally differ from doing the same in a sequential loop.

EDIT: correction: (a+b)+c and a+(b+c) usually give different results

     a + b  and b + a should give identical results though.

There are tricks like Kahan summation to reduce the errors when taking sums of many

elements, you may want to look into this.

Generally when adding values of small magnitude to values of large magnitude, you

will incur a significant loss of precision. When you have large arrays and you add

the block results of the scan to the total, at one point the intermediate result becomes

large, and the individual block results are relatively small. This is where your large error

may come from.

You will need to get Compute 2.0 devices in order to do double precision on the GPU

(GTX 260 or better). On compute 1.x devices you could try emulating double precision

with two floats (also known as double single precision). There is some code on these

forums do do just that. It’s quite slow however.

And finally, float is only precise to some 7 decimal digits, so when you have 5 significant

digits before the decimal point, you can expect only 2 digits after the decimal point to be

accurate.

Thanks for exploring the floats to me :rolleyes:

But i want to know whether there is any function or any method to get the accurate results using float (can afford performance hinderance) ?? (!)

Thanks for exploring the floats to me :rolleyes:

But i want to know whether there is any function or any method to get the accurate results using float (can afford performance hinderance) ?? (!)

Thanks for exploring the floats to me :rolleyes:

But i want to know whether there is any function or any method to get the accurate results using float (can afford performance hinderance) ?? (!)

Uuuh, nice triple posting!

Let’s start here ;)

http://en.wikipedia.org/wiki/Kahan_summation_algorithm

Are you using the SDK’s scan sample? What kind of input values do you use? Random numbers?

That was by mistake thanks to the Net connection here :angry:

No I’m using my own algo.

Can provide my kernel if you are interested :rolleyes:

I’m using numbers linearly i.e, 1,2,3,4,…

Sure, post some code and we can talk.

Looking for issues someone else’s code, that’s partly how I learned CUDA. ;)

I would like to add that for using double precision you only need CC 1.3 devices (e.g. the GT 260 as stated). But these and also the CC 2.0 consumer cards (non-Tesla) will need up to 8 times the single precision execution time when doing DP calcs (in worst case - I more often see 2 to 3 times). With CC 2.0 Tesla cards you willl get half the speed with DP worst.

Edit: Misread sth and corrected my post.

Thanks for the correction. The card is named GTX 260 however.

In contrast the GT 240 only offers CC 1.2 and has no double support. I still like it for its energy efficiency though.

a + b == b + a is commutativity, not associativity (and floating point does commute, to my knowledge)

I think you meant: (a + b) + c is not usually equal to a + (b + c)

Yes, I guess you’re right. It’s a hot summer and thus my statements don’t always make sense ;)

Hey sorry for the delay. External Media

Here’s the code[codebox]global void scan(double *a,int n,double *blockSums,int no_of_blocks)

{

__shared__ double temp[TILE_WIDTH];



int tx=threadIdx.x,bx=blockIdx.x,k;

int idx=bx*TILE_WIDTH+(tx<<1),offset;



temp[tx<<1]=(idx<n)?a[idx]:0;

temp[(tx<<1)+1]=(idx+1<n)?a[idx+1]:0;



for(offset=2;offset<=TILE_WIDTH;offset<<=1)

{

	int i=offset*(tx+1)-1;

	int j=i-(offset>>1);

	if(i<TILE_WIDTH)

		temp[i]+=temp[j];

	__syncthreads();

}



if(tx==0)

	blockSums[bx]=temp[TILE_WIDTH-1];

if(idx<n)

	a[idx]=temp[tx<<1];

if((idx+1)<n)

	a[idx+1]=temp[(tx<<1)+1];

}

global void scan2(double *a,int n,double *blockSums,int no_of_blocks)

{

__shared__ double temp[TILE_WIDTH];



int tx=threadIdx.x,bx=blockIdx.x,k;

int idx=__umul24(bx,TILE_WIDTH)+(tx<<1),offset;



temp[tx<<1]=(idx<n)?a[idx]:0;

temp[(tx<<1)+1]=(idx+1<n)?a[idx+1]:0;

if(tx==0 && bx==0)

	for(offset=2;offset<=no_of_blocks;offset<<=1)	

		for(k=0;k<(no_of_blocks>>1);k++)

		{

			int i=__umul24(offset,(k+1))-1;

			int j=i-(offset>>1);

			if(i<no_of_blocks && tx==0)

				blockSums[i]+=blockSums[j];

		}



if(tx==0 && bx==0)

	blockSums[no_of_blocks-1]=0;

if(tx==0 && bx==0)

	for(offset=no_of_blocks;offset>=2;offset>>=1)		

		for(k=0;k<no_of_blocks>>1;k++)

		{

			int i=__umul24(offset,(k+1))-1;

			int j=i-(offset>>1);					

												

			if(i<no_of_blocks && tx==0)

			{

				double t=blockSums[i];

				blockSums[i]+=blockSums[j];

				blockSums[j]=t;

			}

		}	

	

__syncthreads();

temp[TILE_WIDTH-1]=0;



for(offset=TILE_WIDTH;offset>=2;offset>>=1)

{

	int i=__umul24(offset,(tx+1))-1;

	int j=i-(offset>>1);					

	__syncthreads();					

										

	if(i<TILE_WIDTH)

	{

		double t=temp[i];

		temp[i]+=temp[j];

		temp[j]=t;

	}

}

__syncthreads();

	

temp[tx<<1]+=blockSums[bx];

temp[(tx<<1)+1]+=blockSums[bx];

	

__syncthreads();



if(idx<n)

	a[idx]=temp[tx<<1];

if((idx+1)<n)

	a[idx+1]=temp[(tx<<1)+1];

}[/codebox]

double=float

The code may not be perfect as per Performance is concerned.

But it does the correct Scan.

I think the discussion is surely complete but see if you can add something more External Media

By the way nice fight with others :rolleyes: