Problem with double precision matrix float works, double doesn't

[font=“Courier New”][font=“Courier New”]Hello. So far I haven’t received any answer for my questions here, I hope to have better luck this time.

I am trying to learn matrix operations in CUDA (with Tesla C1060). I have an NN symmetric matrix organized in coumn major. I want to increment the upper half elements by a constant. My simple code has a total of NN threads, so only N(N+1)/2 threads will be active.

I run the kernel and check the matrix after 1000 iterations. With the “CorrectKernel” the results are as expected whether the TYPE is double or float. With “WrongKernel”, if the TYPE is float the results are as expected (the upper half is updated to 2000 and the rest of the matrix is 0), but if the TYPE is double the upper half is updated correctly but the rest of the matrix has random values. It seems like with “double” either the memory gets messed up (alignment problems?) or the idle threads don’t stay idle! What am I doing wrong? I appreciate any help.

[font=“Courier New”][font=“Courier New”]

[codebox]#define TYPE double

#define N 256

TYPE *matrix1, *matrix2;

TYPE alpha = 2;

[/font][/font][font=“Courier”][font=“Courier”]CUDA_SAFE_CALL( cudaMalloc( (void**) &matrix1, NNsizeof(TYPE)));

CUDA_SAFE_CALL( cudaMalloc( (void**) &matrix2, NNsizeof(TYPE)));

for(int i=0; i<1000; i++)

CorrectKernel <<<N,N>>>(matrix1, alpha );

for(int i=0; i<1000; i++)

WrongKernel <<<N,N>>>(matrix2, alpha );

global void CorrectKernel(TYPE *base, const TYPE alpha)

{

int bx= blockIdx.x* N + threadIdx.x;

if(threadIdx.x <= blockIdx.x){

    base[ bx] += alpha;

else

    base[bx] = 0; //force the idle threads to do something

__syncthreads();

}

global void WrongKernel(TYPE *base, const TYPE alpha)

{

int bx= blockIdx.x* N + threadIdx.x;

if(threadIdx.x <= blockIdx.x){

   base[ bx] += alpha;     //only active threads increment, idle threads stay idle

__syncthreads();

}[/codebox]

[/font][/font]

[font=“Courier New”][font=“Courier New”]

[/font][/font][/font][/font]

Compile with -arch sm_13. Meanwhile, I will punch the compiler team for allowing people to use doubles without returning an error when you’re not compiling with sm_13…

Heheheh… Isn’t the automatic demotion of double precision to single precision a documented “feature” of CUDA? That was the line all the way back to CUDA 0.8, though I’m not sure if anyone actually relies on this behavior. I’m not even sure if this automatic-demotion works in practice since it seems like anyone who forgets -arch sm_13 has their kernel explode all over rather than quietly lose precision.

Making double usage without sm_13 a compiler errors gets a +1 vote from me if such a change is possible now.

Thanks for your response. I do complie with “-arch sm_13 -code sm_13” (though I’m not sure if I should have -code). If the code doesn’t have an obvious problem, I suspect that I am missing some initialization. What is the proper way of initializing the GPU in each run?

It’s not that bad whenever you explicitly define a double within a kernel on an architecture that doesn’t support it (constants, basically). It’s wrong whenever you try to pass a double from the host to the device without compiling with arch sm_13. I think the most useful behavior would be a warning in the former (somebody hit that earlier today, for example) and an error for the latter.

mahnaz: Are you checking errors both on the launch (cudaGetLastError immediately after trying to launch the DP kernel) and the kernel execution (check the return status of cudaThreadSynchronize)?

1. Is my modification OK?

(CUDA_BIN_PATH)\nvcc.exe" <b>-arch sm_13</b> -ccbin "(VCInstallDir)bin” -c -D_DEBUG -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/RTC1,/MTd -I"(CUDA_INC_PATH)" -I./ -I../../common/inc -o (ConfigurationName)\simpleCUFFT.obj simpleCUFFT.cu

2. Is this compiler switch better for my 285GTX

-gencode arch=compute_13,code=sm_13

[font=“Verdana”]No, I haven’t been checking errors. I will. The problem is that most of the times the code runs fine. It is an intermittent problem and I can’t reproduce it. When I reboot the system things are correct, until something goes wrong and after that it doesn’t recover and the errors show up with each run. That’s why I think I must be overstepping on something and need to do a better initializing of the GPU. I know I don’t have memory leak, and am very careful about freeing allocated memory. Also the system never crashes, only the content of memory gets messed up. It always work with emurelase and I can’t see any race condition in that simple kernel.

Thanks for your help.

[/font]

Sorry for the double post. I don’t know what I am doing wrong in quoting the code that it doesn’t come out right (the preview window is different from what shows up when I post). I just noticed that in my original post part of the code is cut off. I will add the Correct and Incorrect kernels here, and appreciate some input as to why when idle threads update the memory the result is correct but when they don’t do anything it gets randome values. Thanks again.

[b]
global void CorrectKernel(TYPE *base, const TYPE alpha)
{

int bx= blockIdx.x* N + threadIdx.x;

if(threadIdx.x <= blockIdx.x)
base[ bx] += alpha;
else
base[bx] = 0; //force the idle threads to do something

__syncthreads();
}

global void WrongKernel(TYPE *base, const TYPE alpha)
{

int bx= blockIdx.x* N + threadIdx.x;

if(threadIdx.x <= blockIdx.x)
base[ bx] += alpha; //just the active threads update the array

__syncthreads();
}
[/b]