Wrong output in double precision

Hi,

I’ve a problem when I use double precision.
This is my code:

#define N 512
#define BLOCK_SIZE 16
//#define BLOCK_ZIE 8

global void affectMatrixKernel(double* A, double* B, N)
{
int i=blockIdx.yblockDim.y+threadIdx.y;
int j=blockIdx.x
blockDim.x+threadIdx.x;

B[i*N+j]=A[i*N+j];

}

int main()
{
int size = NNsizeof(double);

//host memory allocation
double* A_h=(double*)malloc(size);
double* B_h=(double*)malloc(size);

//Initialisation of A_h

//device memory allocation
double* A_d;
cudaMalloc(&A_d,size);
double* B_d;
cudaMalloc(&B_d,size);

//Copy into device memory
cudaMemcpy(A_d,A_h,size,cudaMemcpyHostToDevice);
cudaMemcpy(B_d,B_h,size,cudaMemcpyHostToDevice);

//kernel execution
dim3 nbThreadsPerBloc (BLOCK_SIZE,BLOCK_SIZE);
dim3 nbBloc ( ceil(M/(double)nbThreadsPerBloc.x) , ceil(N/(double)nbThreadsPerBloc.y) );

affectMatrixKernel<<<nbBloc,nbThreadsPerBloc>>>(A_d,B_d,N);

//Copy into the host
cudaMemcpy(B_h,B_d,size,cudaMemcpyDeviceToHost);

//Free device memory
cudaFree(A_d);
cudaFree(B_d);

//Free host memory
    free(A_h);
    free(B_h);

}

When I compare A_h and B_h, I have one or two errors in some lines. For example, I have a 2.00000 for B_h whereas I have 5.00000 in A_h ; at corresponding position.
This code works when i use float.
When I use double it works with BLOCK_SIZE=8. If BLOCK_SIZE=16, I have to replace this line:
B[i*N+j]=A[i*N+j]; by B[jN+i]=A[jN+i];
But this isn’t efficient because of the coalesced memory access.

I compile with nvcc: nvcc code.cu -arch=sm_13 -o code
I work on linux with fedora 13.

Thanks in advance for all your answers.

Can you show us the actual program you are running? This one would not compile because [font=“Courier New”]M[/font] is undefined.

While it might not be the problem in this case, the line

dim3 nbBloc ( ceil(M/(double)nbThreadsPerBloc.x) , ceil(N/(double)nbThreadsPerBloc.y) );

is highly fragile because any small rounding error might cause the launch of additional and out-of-bounds blocks. This also highlights that you should check return values of every CUDA call, as the out-of-bounds access and following abortion of the kernel would go completely unnoticed.

your code works on my machine with GTX480.

As tera said, you should use

dim3 nbBloc ( (N+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE )

also check boundary condition in the kernel

__global__ void affectMatrixKernel(double* A, double* B)

{

    int i=blockIdx.y*blockDim.y+threadIdx.y;

    int j=blockIdx.x*blockDim.x+threadIdx.x;

if ( (i < N) && (j < N) ){

        B[i*N+j]=A[i*N+j];

    }

}

Thanks for your answer.

I’ve made a mistake in my post.

My complete code is :

include <stdio.h>

include <stdlib.h>

include <cuda.h>

include <cuda_runtime.h>

define N 512

define BLOCK_SIZE 16

//define BLOCK_SIZE 8

global void affectMatrixKernel(double* A, double* B, int Nb)

{

int i=blockIdx.y*blockDim.y+threadIdx.y;

int j=blockIdx.x*blockDim.x+threadIdx.x;

//if(i<N && j<N)

	B[i*Nb+j]=A[i*Nb+j];

}

int main()

{

int i,j;

int size = N*N*sizeof(double);



//Host memory allocation

double* A_h=(double*)malloc(size);

double* B_h=(double*)malloc(size);



//Initialisation de A_h

for (i=0;i<N;i++)

{

	for (j=0;j<N;j++)

	{

		A_h[i*N+j]=rand() % 5;

	}

}



//Device memory allocation

double* A_d;

cudaMalloc(&A_d,size);

double* B_d;

cudaMalloc(&B_d,size);



//Copy from host to devicde

cudaMemcpy(A_d,A_h,size,cudaMemcpyHostToDevice);

cudaMemcpy(B_d,B_h,size,cudaMemcpyHostToDevice);

//Kernel execution

dim3 nbThreadsPerBloc (BLOCK_SIZE,BLOCK_SIZE);

dim3 nbBloc ( ceil(N/(double)nbThreadsPerBloc.x) , ceil(N/(double)nbThreadsPerBloc.y) );



affectMatrixKernel<<<nbBloc,nbThreadsPerBloc>>>(A_d,B_d,N);



//Copie from device to host

cudaMemcpy(B_h,B_d,size,cudaMemcpyDeviceToHost);

//Free device memory

cudaFree(A_d);

cudaFree(B_d);

//Write A_h and B_h in .txt files

FILE* result_GPU;

FILE* A;

A=fopen("A.txt","w");

result_GPU=fopen("A_copy.txt","w");

if (A)

{

	for (i=0;i<N;i++)

	{

		for (j=0;j<N;j++)

		{

			fprintf(A,"%f\t",A_h[i*N+j]);

		}

	fputc('\n',A);

	}

}

if (result_GPU)

{

	for (i=0;i<N;i++)

	{

		for (j=0;j<N;j++)

		{

			fprintf(result_GPU,"%f\t",B_h[i*N+j]);

		}

	fputc('\n',result_GPU);

	}

}

fclose(A);

fclose(result_GPU);



//Libération host memory

free(A_h);

free(B_h);

}

It works with float. With double I have to put BLOCK_SIZE=8.

I work on this problem since one week …

Thanks.

It doesn’t work with :
dim3 nbBloc ( (N+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE ) …

I compare with the diff command :
diff A.txt A_copy.txt

With : dim3 nbBloc ( (N+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE )

It works but by puting BLOCK_SIZE=8, not for BLOCK_SIZE=16.

Maybe because I work on 1.3 compute Capability (GTX 285), and you with 2.0 ?
It seems to be too easy …

CC1.3 works for double-precision.

Do you use CUDA 4.0?

I don’t know, how could I know it?

check version of nvcc

nvcc -V

It gives me :

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2010 NVIDIA Corporation

Built on Wed_Nov__3_16:16:57_PDT_2010

Cuda compilation tools, release 3.2, V0.2.1221

Sometimes the code works, with N=512.

But when N=1024, it doesn’t …

I try another machine which has C1060 and CUDA 3.2.
Your code works.
So it should not be problem of compiler.

Do you try another card?

Thanks for your answers.

Np, I don’t try it.

At first view, I thought it works. But when I compare by diff command, there are errors in some lines. Just for one or two position by line.

Maybe because of the card like you said.

I haven’t another card …

What is your system config?

For gcc I have :

gcc (GCC) 4.4.4 20100503 (Red Hat 4.4.4-2)

Copyright (C) 2010 Free Software Foundation, Inc.

This is free software; see the source for copying conditions. There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

My linux is :

2.6.33.3-85.fc13.x86_64 #1 SMP Thu May 6 18:09:49 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

Maybe it’s a bug with my card, how could I report it ?

Thanks.

gcc (GCC) 4.3.2 20081105 (Red Hat 4.3.2-7)

Copyright (C) 2008 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I don’t think that this is a hardware issue because you just move 64-bit data.

Could you try float2 (replace “double” by “float2”)?

Also you can install CUDA 4.0 RC2.

It doesn’t compile with float2 because of the random value:

//Initialisation of A_h

for (i=0;i<N;i++)

{

	for (j=0;j<N;j++)

	{

		A_h[i*N+j]=rand() % 5;

	}

}

I try to install CUDA 4.0.

try attached file.

Your code works for N=512 and BLOCK_SIZE=16. I have max err=.=0.0000 .

But for N=1024 and BLOCK_SIZE=16 it doesn’t, I have max err=32.0000. So in this case, A.txt and A_copy.txt are differents.

Have you tried for N=1024?

Thanks.

N=1024 works. I test it on TeslaC1060 with CUDA 3.2 and GTX480 with CUDA 4.0.

Do you have other cards, low-end, for example, GT8xxx, GT9xxx,…?

If you have, then you can test this program, you don’t need to compile with sm13 because float2 is not “double”.

If the program works on other cards, then your GTX285 may have some problem.