Hello,
I installed CUDA 4.0 without problems in Debian 6.0 using “cudatoolkit_4.0.13_linux_32_ubuntu10.10.run” and “devdriver_4.0_linux_32_270.40.run”. I can compile all examples in “gpucomputingsdk_4.0.13_linux.run” and they run correctly.
However, when I compile one program that I wrote to simulate propagation in the Cubic-Quintic Complex Ginzburg-Landau Equation, program compiled with CUDA 4.0 is slower than the same code compiled using CUDA 3.1 (80 s from CUDA 4.0 against 50 s from CUDA 3.1).
The Makefile content is:
Add source files here
EXECUTABLE := cgle2dcuda_sm20
Cuda source files (compiled with cudacc)
CUFILES_sm_20 := cgle2dcuda_sm20.cu
Additional compiler flags and LIBs to include
USECUFFT := 1
################################################################################
Rules and targets
include …/…/common/common.mk
I used cufftExecZ2Z functions to pass from time space to Fourier space, both forward and backward, and two kernels to make propagation calculations. An example of the code we used is:
cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datau, (cufftDoubleComplex*)d_datau, CUFFT_FORWARD);
cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datav, (cufftDoubleComplex*)d_datav, CUFFT_FORWARD);
propagateData<<<NBLOCK,NHILOS>>>(d_datau,d_datav, N_SIZE, d_D11u,d_D11v,h_c[3]);
cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datau, (cufftDoubleComplex*)d_datau, CUFFT_INVERSE);
cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datav, (cufftDoubleComplex*)d_datav, CUFFT_INVERSE);
and propagateData function is:
static __global__ void propagateData(cufftDoubleComplex* a, cufftDoubleComplex* b, int size, cuDoubleComplex* D11a, cuDoubleComplex* D11b, double d_c)
{
double norm=1./(double)size;
double paso=0.0001;
const int numThreads = blockDim.x * gridDim.x;
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = threadID; i < size; i += numThreads){
a[i]=propagateLineal(a[i],D11a[i],paso, d_c, norm);
b[i]=propagateLineal(b[i],D11b[i],paso, d_c, norm);
}
}
Anybody knows why CUDA 4.0 compiled code run slower than CUDA 3.1 compiled code or what am I doing wrong? I used a GeForce GTX470 card.
Thanks.