Hi,
I’ve a problem when I use double precision.
This is my code:
#define N 512
#define BLOCK_SIZE 16
//#define BLOCK_ZIE 8
global void affectMatrixKernel(double* A, double* B, N)
{
int i=blockIdx.yblockDim.y+threadIdx.y;
int j=blockIdx.x blockDim.x+threadIdx.x;
B[i*N+j]=A[i*N+j];
}
int main()
{
int size = NN sizeof(double);
//host memory allocation
double* A_h=(double*)malloc(size);
double* B_h=(double*)malloc(size);
//Initialisation of A_h
//device memory allocation
double* A_d;
cudaMalloc(&A_d,size);
double* B_d;
cudaMalloc(&B_d,size);
//Copy into device memory
cudaMemcpy(A_d,A_h,size,cudaMemcpyHostToDevice);
cudaMemcpy(B_d,B_h,size,cudaMemcpyHostToDevice);
//kernel execution
dim3 nbThreadsPerBloc (BLOCK_SIZE,BLOCK_SIZE);
dim3 nbBloc ( ceil(M/(double)nbThreadsPerBloc.x) , ceil(N/(double)nbThreadsPerBloc.y) );
affectMatrixKernel<<<nbBloc,nbThreadsPerBloc>>>(A_d,B_d,N);
//Copy into the host
cudaMemcpy(B_h,B_d,size,cudaMemcpyDeviceToHost);
//Free device memory
cudaFree(A_d);
cudaFree(B_d);
//Free host memory
free(A_h);
free(B_h);
}
When I compare A_h and B_h, I have one or two errors in some lines. For example, I have a 2.00000 for B_h whereas I have 5.00000 in A_h ; at corresponding position.
This code works when i use float.
When I use double it works with BLOCK_SIZE=8. If BLOCK_SIZE=16, I have to replace this line:
B[i*N+j]=A[i*N+j]; by B[jN+i]=A[j N+i];
But this isn’t efficient because of the coalesced memory access.
I compile with nvcc: nvcc code.cu -arch=sm_13 -o code
I work on linux with fedora 13.
Thanks in advance for all your answers.
tera
May 17, 2011, 1:11pm
2
Can you show us the actual program you are running? This one would not compile because [font=“Courier New”]M[/font] is undefined.
While it might not be the problem in this case, the line
dim3 nbBloc ( ceil(M/(double)nbThreadsPerBloc.x) , ceil(N/(double)nbThreadsPerBloc.y) );
is highly fragile because any small rounding error might cause the launch of additional and out-of-bounds blocks. This also highlights that you should check return values of every CUDA call, as the out-of-bounds access and following abortion of the kernel would go completely unnoticed.
your code works on my machine with GTX480.
As tera said, you should use
dim3 nbBloc ( (N+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE )
also check boundary condition in the kernel
__global__ void affectMatrixKernel(double* A, double* B)
{
int i=blockIdx.y*blockDim.y+threadIdx.y;
int j=blockIdx.x*blockDim.x+threadIdx.x;
if ( (i < N) && (j < N) ){
B[i*N+j]=A[i*N+j];
}
}
Can you show us the actual program you are running? This one would not compile because [font=“Courier New”]M[/font] is undefined.
While it might not be the problem in this case, the line
dim3 nbBloc ( ceil(M/(double)nbThreadsPerBloc.x) , ceil(N/(double)nbThreadsPerBloc.y) );
is highly fragile because any small rounding error might cause the launch of additional and out-of-bounds blocks. This also highlights that you should check return values of every CUDA call, as the out-of-bounds access and following abortion of the kernel would go completely unnoticed.
Thanks for your answer.
I’ve made a mistake in my post.
My complete code is :
include <stdio.h>
include <stdlib.h>
include <cuda.h>
include <cuda_runtime.h>
define N 512
define BLOCK_SIZE 16
//define BLOCK_SIZE 8
global void affectMatrixKernel(double* A, double* B, int Nb)
{
int i=blockIdx.y*blockDim.y+threadIdx.y;
int j=blockIdx.x*blockDim.x+threadIdx.x;
//if(i<N && j<N)
B[i*Nb+j]=A[i*Nb+j];
}
int main()
{
int i,j;
int size = N*N*sizeof(double);
//Host memory allocation
double* A_h=(double*)malloc(size);
double* B_h=(double*)malloc(size);
//Initialisation de A_h
for (i=0;i<N;i++)
{
for (j=0;j<N;j++)
{
A_h[i*N+j]=rand() % 5;
}
}
//Device memory allocation
double* A_d;
cudaMalloc(&A_d,size);
double* B_d;
cudaMalloc(&B_d,size);
//Copy from host to devicde
cudaMemcpy(A_d,A_h,size,cudaMemcpyHostToDevice);
cudaMemcpy(B_d,B_h,size,cudaMemcpyHostToDevice);
//Kernel execution
dim3 nbThreadsPerBloc (BLOCK_SIZE,BLOCK_SIZE);
dim3 nbBloc ( ceil(N/(double)nbThreadsPerBloc.x) , ceil(N/(double)nbThreadsPerBloc.y) );
affectMatrixKernel<<<nbBloc,nbThreadsPerBloc>>>(A_d,B_d,N);
//Copie from device to host
cudaMemcpy(B_h,B_d,size,cudaMemcpyDeviceToHost);
//Free device memory
cudaFree(A_d);
cudaFree(B_d);
//Write A_h and B_h in .txt files
FILE* result_GPU;
FILE* A;
A=fopen("A.txt","w");
result_GPU=fopen("A_copy.txt","w");
if (A)
{
for (i=0;i<N;i++)
{
for (j=0;j<N;j++)
{
fprintf(A,"%f\t",A_h[i*N+j]);
}
fputc('\n',A);
}
}
if (result_GPU)
{
for (i=0;i<N;i++)
{
for (j=0;j<N;j++)
{
fprintf(result_GPU,"%f\t",B_h[i*N+j]);
}
fputc('\n',result_GPU);
}
}
fclose(A);
fclose(result_GPU);
//Libération host memory
free(A_h);
free(B_h);
}
It works with float. With double I have to put BLOCK_SIZE=8.
I work on this problem since one week …
Thanks.
It doesn’t work with :
dim3 nbBloc ( (N+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE ) …
I compare with the diff command :
diff A.txt A_copy.txt
With : dim3 nbBloc ( (N+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE )
It works but by puting BLOCK_SIZE=8, not for BLOCK_SIZE=16.
Maybe because I work on 1.3 compute Capability (GTX 285), and you with 2.0 ?
It seems to be too easy …
CC1.3 works for double-precision.
Do you use CUDA 4.0?
I don’t know, how could I know it?
check version of nvcc
nvcc -V
It gives me :
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2010 NVIDIA Corporation
Built on Wed_Nov__3_16:16:57_PDT_2010
Cuda compilation tools, release 3.2, V0.2.1221
It gives me :
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2010 NVIDIA Corporation
Built on Wed_Nov__3_16:16:57_PDT_2010
Cuda compilation tools, release 3.2, V0.2.1221
Sometimes the code works, with N=512.
But when N=1024, it doesn’t …
I try another machine which has C1060 and CUDA 3.2.
Your code works.
So it should not be problem of compiler.
Do you try another card?
Thanks for your answers.
Np, I don’t try it.
At first view, I thought it works. But when I compare by diff command, there are errors in some lines. Just for one or two position by line.
Maybe because of the card like you said.
I haven’t another card …
What is your system config?
For gcc I have :
gcc (GCC) 4.4.4 20100503 (Red Hat 4.4.4-2)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
My linux is :
2.6.33.3-85.fc13.x86_64 #1 SMP Thu May 6 18:09:49 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
Maybe it’s a bug with my card, how could I report it ?
Thanks.
I haven’t another card …
What is your system config?
For gcc I have :
gcc (GCC) 4.4.4 20100503 (Red Hat 4.4.4-2)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
My linux is :
2.6.33.3-85.fc13.x86_64 #1 SMP Thu May 6 18:09:49 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
Maybe it’s a bug with my card, how could I report it ?
gcc (GCC) 4.3.2 20081105 (Red Hat 4.3.2-7)
Copyright (C) 2008 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I don’t think that this is a hardware issue because you just move 64-bit data.
Could you try float2 (replace “double” by “float2”)?
Also you can install CUDA 4.0 RC2.
gcc (GCC) 4.3.2 20081105 (Red Hat 4.3.2-7)
Copyright (C) 2008 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I don’t think that this is a hardware issue because you just move 64-bit data.
Could you try float2 (replace “double” by “float2”)?
Also you can install CUDA 4.0 RC2.
It doesn’t compile with float2 because of the random value:
//Initialisation of A_h
for (i=0;i<N;i++)
{
for (j=0;j<N;j++)
{
A_h[i*N+j]=rand() % 5;
}
}
I try to install CUDA 4.0.
Your code works for N=512 and BLOCK_SIZE=16. I have max err=.=0.0000 .
But for N=1024 and BLOCK_SIZE=16 it doesn’t, I have max err=32.0000. So in this case, A.txt and A_copy.txt are differents.
Have you tried for N=1024?
Thanks.
Your code works for N=512 and BLOCK_SIZE=16. I have max err=.=0.0000 .
But for N=1024 and BLOCK_SIZE=16 it doesn’t, I have max err=32.0000. So in this case, A.txt and A_copy.txt are differents.
Have you tried for N=1024?
N=1024 works. I test it on TeslaC1060 with CUDA 3.2 and GTX480 with CUDA 4.0.
Do you have other cards, low-end, for example, GT8xxx, GT9xxx,…?
If you have, then you can test this program, you don’t need to compile with sm13 because float2 is not “double”.
If the program works on other cards, then your GTX285 may have some problem.