Wrong output in double precision

loloasb · May 17, 2011, 12:57pm

Hi,

I’ve a problem when I use double precision.
This is my code:

#define N 512
#define BLOCK_SIZE 16
//#define BLOCK_ZIE 8

global void affectMatrixKernel(double* A, double* B, N)
{
int i=blockIdx.yblockDim.y+threadIdx.y;
int j=blockIdx.xblockDim.x+threadIdx.x;

B[i*N+j]=A[i*N+j];

}

int main()
{
int size = NNsizeof(double);

//host memory allocation
double* A_h=(double*)malloc(size);
double* B_h=(double*)malloc(size);

//Initialisation of A_h

//device memory allocation
double* A_d;
cudaMalloc(&A_d,size);
double* B_d;
cudaMalloc(&B_d,size);

//Copy into device memory
cudaMemcpy(A_d,A_h,size,cudaMemcpyHostToDevice);
cudaMemcpy(B_d,B_h,size,cudaMemcpyHostToDevice);

//kernel execution
dim3 nbThreadsPerBloc (BLOCK_SIZE,BLOCK_SIZE);
dim3 nbBloc ( ceil(M/(double)nbThreadsPerBloc.x) , ceil(N/(double)nbThreadsPerBloc.y) );

affectMatrixKernel<<<nbBloc,nbThreadsPerBloc>>>(A_d,B_d,N);

//Copy into the host
cudaMemcpy(B_h,B_d,size,cudaMemcpyDeviceToHost);

//Free device memory
cudaFree(A_d);
cudaFree(B_d);

//Free host memory
    free(A_h);
    free(B_h);

}

When I compare A_h and B_h, I have one or two errors in some lines. For example, I have a 2.00000 for B_h whereas I have 5.00000 in A_h ; at corresponding position.
This code works when i use float.
When I use double it works with BLOCK_SIZE=8. If BLOCK_SIZE=16, I have to replace this line:
B[i*N+j]=A[i*N+j]; by B[jN+i]=A[jN+i];
But this isn’t efficient because of the coalesced memory access.

I compile with nvcc: nvcc code.cu -arch=sm_13 -o code
I work on linux with fedora 13.

Thanks in advance for all your answers.

tera · May 17, 2011, 1:11pm

Can you show us the actual program you are running? This one would not compile because [font=“Courier New”]M[/font] is undefined.

While it might not be the problem in this case, the line

dim3 nbBloc ( ceil(M/(double)nbThreadsPerBloc.x) , ceil(N/(double)nbThreadsPerBloc.y) );

is highly fragile because any small rounding error might cause the launch of additional and out-of-bounds blocks. This also highlights that you should check return values of every CUDA call, as the out-of-bounds access and following abortion of the kernel would go completely unnoticed.

LSChien · May 17, 2011, 1:27pm

your code works on my machine with GTX480.

As tera said, you should use

dim3 nbBloc ( (N+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE )

also check boundary condition in the kernel

__global__ void affectMatrixKernel(double* A, double* B)

{

    int i=blockIdx.y*blockDim.y+threadIdx.y;

    int j=blockIdx.x*blockDim.x+threadIdx.x;

if ( (i < N) && (j < N) ){

        B[i*N+j]=A[i*N+j];

    }

}

loloasb · May 17, 2011, 1:40pm

Can you show us the actual program you are running? This one would not compile because [font=“Courier New”]M[/font] is undefined.

While it might not be the problem in this case, the line
dim3 nbBloc ( ceil(M/(double)nbThreadsPerBloc.x) , ceil(N/(double)nbThreadsPerBloc.y) );
is highly fragile because any small rounding error might cause the launch of additional and out-of-bounds blocks. This also highlights that you should check return values of every CUDA call, as the out-of-bounds access and following abortion of the kernel would go completely unnoticed.

Thanks for your answer.

I’ve made a mistake in my post.

My complete code is :

include <stdio.h>

include <stdlib.h>

include <cuda.h>

include <cuda_runtime.h>

define N 512

define BLOCK_SIZE 16

//define BLOCK_SIZE 8

global void affectMatrixKernel(double* A, double* B, int Nb)

{

int i=blockIdx.y*blockDim.y+threadIdx.y;

int j=blockIdx.x*blockDim.x+threadIdx.x;

//if(i<N && j<N)

	B[i*Nb+j]=A[i*Nb+j];

}

int main()

{

int i,j;

int size = N*N*sizeof(double);



//Host memory allocation

double* A_h=(double*)malloc(size);

double* B_h=(double*)malloc(size);



//Initialisation de A_h

for (i=0;i<N;i++)

{

	for (j=0;j<N;j++)

	{

		A_h[i*N+j]=rand() % 5;

	}

}



//Device memory allocation

double* A_d;

cudaMalloc(&A_d,size);

double* B_d;

cudaMalloc(&B_d,size);



//Copy from host to devicde

cudaMemcpy(A_d,A_h,size,cudaMemcpyHostToDevice);

cudaMemcpy(B_d,B_h,size,cudaMemcpyHostToDevice);

//Kernel execution

dim3 nbThreadsPerBloc (BLOCK_SIZE,BLOCK_SIZE);

dim3 nbBloc ( ceil(N/(double)nbThreadsPerBloc.x) , ceil(N/(double)nbThreadsPerBloc.y) );



affectMatrixKernel<<<nbBloc,nbThreadsPerBloc>>>(A_d,B_d,N);



//Copie from device to host

cudaMemcpy(B_h,B_d,size,cudaMemcpyDeviceToHost);

//Free device memory

cudaFree(A_d);

cudaFree(B_d);

//Write A_h and B_h in .txt files

FILE* result_GPU;

FILE* A;

A=fopen("A.txt","w");

result_GPU=fopen("A_copy.txt","w");

if (A)

{

	for (i=0;i<N;i++)

	{

		for (j=0;j<N;j++)

		{

			fprintf(A,"%f\t",A_h[i*N+j]);

		}

	fputc('\n',A);

	}

}

if (result_GPU)

{

	for (i=0;i<N;i++)

	{

		for (j=0;j<N;j++)

		{

			fprintf(result_GPU,"%f\t",B_h[i*N+j]);

		}

	fputc('\n',result_GPU);

	}

}

fclose(A);

fclose(result_GPU);



//LibÃ©ration host memory

free(A_h);

free(B_h);

}

It works with float. With double I have to put BLOCK_SIZE=8.

I work on this problem since one week …

Thanks.

loloasb · May 17, 2011, 1:43pm

It doesn’t work with :
dim3 nbBloc ( (N+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE ) …

loloasb · May 17, 2011, 1:45pm

I compare with the diff command :
diff A.txt A_copy.txt

With : dim3 nbBloc ( (N+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE )

It works but by puting BLOCK_SIZE=8, not for BLOCK_SIZE=16.

loloasb · May 17, 2011, 1:56pm

Maybe because I work on 1.3 compute Capability (GTX 285), and you with 2.0 ?
It seems to be too easy …

LSChien · May 17, 2011, 2:39pm

CC1.3 works for double-precision.

Do you use CUDA 4.0?

loloasb · May 17, 2011, 2:50pm

I don’t know, how could I know it?

LSChien · May 17, 2011, 5:02pm

check version of nvcc

nvcc -V

loloasb · May 18, 2011, 6:41am

It gives me :

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver

Built on Wed_Nov__3_16:16:57_PDT_2010

Cuda compilation tools, release 3.2, V0.2.1221

loloasb · May 18, 2011, 8:44am

Sometimes the code works, with N=512.

But when N=1024, it doesn’t …

LSChien · May 18, 2011, 1:45pm

I try another machine which has C1060 and CUDA 3.2.
Your code works.
So it should not be problem of compiler.

Do you try another card?

loloasb · May 19, 2011, 7:07am

Thanks for your answers.

Np, I don’t try it.

At first view, I thought it works. But when I compare by diff command, there are errors in some lines. Just for one or two position by line.

Maybe because of the card like you said.

loloasb · May 19, 2011, 8:02am

I haven’t another card …

What is your system config?

For gcc I have :

gcc (GCC) 4.4.4 20100503 (Red Hat 4.4.4-2)

This is free software; see the source for copying conditions. There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

My linux is :

2.6.33.3-85.fc13.x86_64 #1 SMP Thu May 6 18:09:49 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

Maybe it’s a bug with my card, how could I report it ?

Thanks.

LSChien · May 19, 2011, 8:14am

gcc (GCC) 4.3.2 20081105 (Red Hat 4.3.2-7)

Copyright (C) 2008 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I don’t think that this is a hardware issue because you just move 64-bit data.

Could you try float2 (replace “double” by “float2”)?

Also you can install CUDA 4.0 RC2.

loloasb · May 19, 2011, 8:37am

gcc (GCC) 4.3.2 20081105 (Red Hat 4.3.2-7)

Copyright (C) 2008 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I don’t think that this is a hardware issue because you just move 64-bit data.

Could you try float2 (replace “double” by “float2”)?

Also you can install CUDA 4.0 RC2.

It doesn’t compile with float2 because of the random value:

//Initialisation of A_h

for (i=0;i<N;i++)

{

	for (j=0;j<N;j++)

	{

		A_h[i*N+j]=rand() % 5;

	}

}

I try to install CUDA 4.0.

LSChien · May 19, 2011, 9:48am

try attached file.

loloasb · May 19, 2011, 11:15am

Your code works for N=512 and BLOCK_SIZE=16. I have max err=.=0.0000 .

But for N=1024 and BLOCK_SIZE=16 it doesn’t, I have max err=32.0000. So in this case, A.txt and A_copy.txt are differents.

Have you tried for N=1024?

Thanks.

LSChien · May 19, 2011, 1:11pm

N=1024 works. I test it on TeslaC1060 with CUDA 3.2 and GTX480 with CUDA 4.0.

Do you have other cards, low-end, for example, GT8xxx, GT9xxx,…?

If you have, then you can test this program, you don’t need to compile with sm13 because float2 is not “double”.

If the program works on other cards, then your GTX285 may have some problem.

Topic		Replies	Views
Issues with double precision support on GT200 CUDA Programming and Performance	7	2788	July 7, 2008
Issues with double precision support on GT200 CUDA Programming and Performance	0	3701	July 4, 2008
problem with double precision unpredictable results Different run give differents errors or no error CUDA Programming and Performance	12	2897	September 10, 2010
worked fine for "int" "float" but NOT "double" CUDA Programming and Performance	13	5080	March 9, 2009
Problem with double precision matrix float works, double doesn't CUDA Programming and Performance	7	3977	April 22, 2009
Kernel works in single precision but not in double CUDA Programming and Performance	7	1664	July 28, 2009
Problem using double precision arithmetic on GT200 Incorrect results using double precision CUDA Programming and Performance	0	4288	May 30, 2008
Double Precision Problem Legacy PGI Compilers	8	5030	November 12, 2010
No results are written to output buffer - why? CUDA Programming and Performance	34	19805	July 15, 2009
Double precision in CUDA 2.3 CUDA Programming and Performance	5	38223	March 5, 2010

Wrong output in double precision

Related topics