random output running code on Fermi card

Hello,

I’m doing research on medical imaging using cuda. I’ve been using the GTX 295 (compile using compute_13,sm_13) and my code runs great on it. My university just received a Tesla C2070 so I wanted to see how much faster my code will run on the new card. When I compile my code using “compute_20,sm_20” for the Tesla C2070, I get random output. Each time I run the program, I get different output.

I’ve included output from compiling one of the .cu files from Visual Studio 2010.

Any help will be much appreciated!

1>CudaBuild:
1> Compiling CUDA source file samProRKAKernel.cu…
1>
1> C:\Users\mac1190\documents\visual studio 2010\Projects\Heart_x64\Heart_x64>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\bin\nvcc.exe” -gencode=arch=compute_20,code="sm_20,compute_20" --use-local-env --cl-version 2010 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\x86_amd64” -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\include" -G0 --keep-dir “x64\Debug” -maxrregcount=0 --ptxas-options=-v --machine 64 --compile -ftz=false -prec-div=true -prec-sqrt=true -D_NEXUS_DEBUG -g -Xcompiler "/EHsc /nologo /O2 /Zi /MDd " -o “x64\Debug\samProRKAKernel.cu.obj” “C:\Users\mac1190\documents\visual studio 2010\Projects\Heart_x64\Heart_x64\samProRKAKernel.cu”
1> samProRKAKernel.cu
1> tmpxft_000012e4_00000000-0_samProRKAKernel.cudafe1.gpu
1> tmpxft_000012e4_00000000-5_samProRKAKernel.cudafe2.gpu
1> samProRKAKernel.cu
1> ptxas info : Compiling entry function ‘_Z8cudaUVFgPdS_S_S_S_ii’ for ‘sm_20’
1> ptxas info : Function properties for _Z8cudaUVFgPdS_S_S_S_ii
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas info : Used 16 registers, 80 bytes cmem[0]
1> tmpxft_000012e4_00000000-0_samProRKAKernel.cudafe1.cpp
1> tmpxft_000012e4_00000000-11_samProRKAKernel.ii
1> Deleting file “tmpxft_000012e4_00000000-6_samProRKAKernel.cpp3.o”.

Try to use cuda-memcheck.
You probably have an out of bound access in shared memory ( on Fermi it will access some other memory locations).

Also check computation time and error code from launch. Also Tesla has different number of sm. Also can you debug your program to see what is going on?

I used cuda-memcheck and it says there were no errors.

========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

The execution time of the program is about 4 times faster than the GTX 295 with no changes to the code for the new architecture. Debugging isn’t the easiest because the code works with very large matrices. I’ve compared the output of one matrix computed using the CPU and the Tesla and all the elements of the matrix were incorrect.

I ran my reduction and transpose kernels in a different project and the CPU and Tesla agreed.

Btw, you may compiler code with 1.2 for Fermi and check if it runs correctly. You do not need to use 2.0 for Fermi. And often 1.2 is faster on Fermi than 2.0. Also try 4.1 it has new compiler for Fermi, maybe it was compiler bug.

I found and fixed the error. First, I would like to describe what the kernel did that produced the error. Remember, this was only a problem in the Tesla C2070, not the GTX-295.

The kernel was given a matrix. Each row of the matrix was reduced to two numbers. Then, these two numbers were reduced further into one number. The output was a column vector, where each element is a reduction of each row of the matrix.

The error was in the following code:
if (tid == 0 && blockIdx.x == 0)
Y[blockIdx.y] = B[linBlockId] + B[linBlockId + 1];//Y is the output column vector of the kernel

This may seem inefficient but it was faster than transferring a matrix to the CPU, performing the final reduction on the CPU, and then transfer the now vector back to the GPU. Anyways, the values in B (global memory) were correct in the Tesla C2070. So, I do not understand why I was getting NAN in Y using the 4.1 sdk and random output using the 4.0 sdk on the Tesla C2070.

I fixed the problem by removing the line of code above and used the data in B for the program. It worked out since I believe this is faster than the previous implementation.

Thanks for the help!