Code compiled in CUDA 4.0 slower than CUDA 3.1

Hello,

I installed CUDA 4.0 without problems in Debian 6.0 using “cudatoolkit_4.0.13_linux_32_ubuntu10.10.run” and “devdriver_4.0_linux_32_270.40.run”. I can compile all examples in “gpucomputingsdk_4.0.13_linux.run” and they run correctly.

However, when I compile one program that I wrote to simulate propagation in the Cubic-Quintic Complex Ginzburg-Landau Equation, program compiled with CUDA 4.0 is slower than the same code compiled using CUDA 3.1 (80 s from CUDA 4.0 against 50 s from CUDA 3.1).

The Makefile content is:

Add source files here

EXECUTABLE := cgle2dcuda_sm20

Cuda source files (compiled with cudacc)

CUFILES_sm_20 := cgle2dcuda_sm20.cu

Additional compiler flags and LIBs to include

USECUFFT := 1
################################################################################

Rules and targets

include …/…/common/common.mk

I used cufftExecZ2Z functions to pass from time space to Fourier space, both forward and backward, and two kernels to make propagation calculations. An example of the code we used is:

	             cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datau, (cufftDoubleComplex*)d_datau, CUFFT_FORWARD);
		     cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datav, (cufftDoubleComplex*)d_datav, CUFFT_FORWARD);


		     propagateData<<<NBLOCK,NHILOS>>>(d_datau,d_datav, N_SIZE, d_D11u,d_D11v,h_c[3]);

		     cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datau, (cufftDoubleComplex*)d_datau, CUFFT_INVERSE);
		     cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datav, (cufftDoubleComplex*)d_datav, CUFFT_INVERSE);

and propagateData function is:

	static __global__ void propagateData(cufftDoubleComplex* a, cufftDoubleComplex* b, int size, cuDoubleComplex* D11a, cuDoubleComplex* D11b, double d_c)
	{
		double norm=1./(double)size;
		double paso=0.0001; 
		const int numThreads = blockDim.x * gridDim.x;
		const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
		for (int i = threadID; i < size; i += numThreads){
		a[i]=propagateLineal(a[i],D11a[i],paso, d_c, norm);
		b[i]=propagateLineal(b[i],D11b[i],paso, d_c, norm);
		}
	}

Anybody knows why CUDA 4.0 compiled code run slower than CUDA 3.1 compiled code or what am I doing wrong? I used a GeForce GTX470 card.

Thanks.

Hi,

I have the same problem but under windows 7. A code compiled under 3.1 runs twice faster than recompiled under 4.0.

I tried compiling under 4.0 and linking to 3.1 libraries, but that doesn’t seem to help.

Nobody else has observed this?

I tried 3.2 and it’s as fast as 3.1. So something happened between 3.2 and 4.0. I hope that the guys at NVIDIA will fix it or tell us what we do wrong, but for now, I’ll stay under 3.2.

How many registers do the kernels use?

I’m pretty new to CUDA so I don’t know how to check that. Is there some tool to get that information?

Compile with [font=“Courier New”]–ptxas-options=-v[/font] added to nvcc’s command line.

Ok, thanks for the tip. So the --ptxas options gives me the following additional information (under CUDA 3.2)

1>ptxas info : Compiling entry function ‘_Z15sobolGPU_kerneljjPjPf’ for ‘sm_10’

1>ptxas info : Used 9 registers, 144+16 bytes smem, 20 bytes cmem[1]

.

.

.

1>ptxas info : Compiling entry function ‘_Z15sobolGPU_kerneljjPjPf’ for ‘sm_20’

1>ptxas info : Used 15 registers, 128+0 bytes smem, 48 bytes cmem[0]

.

.

.

1>ptxas info : Compiling entry function ‘_Z12mcGPU_kernelPfS_jjjPjS_S_S_i’ for ‘sm_10’

1>ptxas info : Used 32 registers, 176+16 bytes smem, 92 bytes cmem[1]

.

.

.

1>ptxas info : Compiling entry function ‘_Z12mcGPU_kernelPfS_jjjPjS_S_S_i’ for ‘sm_20’

1>ptxas info : Used 32 registers, 128+0 bytes smem, 72 bytes cmem[0], 64 bytes cmem[16]

Does this tell you something? Thanks for the help,

sebgur

And does the register use increase under CUDA 4.0?

yes, the compilation in 4.0 uses slightly more registers.

Here are samples of the compilation report in 3.2

1>sobol_gpu.cu

1>ptxas info : Compiling entry function ‘_Z15sobolGPU_kerneljjPjPf’ for ‘sm_20’

1>ptxas info : Used 15 registers, 128+0 bytes smem, 48 bytes cmem[0]

1>ptxas info : Compiling entry function ‘_Z15sobolGPU_kerneljjPjPf’ for ‘sm_10’

1>ptxas info : Used 9 registers, 144+16 bytes smem, 20 bytes cmem[1]

1>simplemc_gpu.cu

1>ptxas info : Compiling entry function ‘_Z12mcGPU_kernelPfS_jjjPjS_S_f’ for ‘sm_20’

1>ptxas info : Used 27 registers, 128+0 bytes smem, 68 bytes cmem[0], 64 bytes cmem[16]

1>ptxas info : Compiling entry function ‘_Z12mcGPU_kernelPfS_jjjPjS_S_f’ for ‘sm_10’

1>ptxas info : Used 25 registers, 176+16 bytes smem, 88 bytes cmem[1]

and the same ting in 4.0 (the orders of sm_10 and sm_20 are inverted in 4.0)

1>sobol_gpu.cu

1>ptxas info : Compiling entry function ‘_Z15sobolGPU_kerneljjPjPf’ for ‘sm_10’

1>ptxas info : Used 9 registers, 144+16 bytes smem, 20 bytes cmem[1]

1>ptxas info : Compiling entry function ‘_Z15sobolGPU_kerneljjPjPf’ for ‘sm_20’

1>ptxas info : Function properties for _Z15sobolGPU_kerneljjPjPf

1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

1>ptxas info : Used 16 registers, 128+0 bytes smem, 48 bytes cmem[0]

1>simplemc_gpu.cu

1>ptxas info : Compiling entry function ‘_Z12mcGPU_kernelPfS_jjjPjS_S_f’ for ‘sm_10’

1>ptxas info : Used 25 registers, 176+16 bytes smem, 88 bytes cmem[1]

1>ptxas info : Compiling entry function ‘_Z12mcGPU_kernelPfS_jjjPjS_S_f’ for ‘sm_20’

1>ptxas info : Function properties for _Z12mcGPU_kernelPfS_jjjPjS_S_f

1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

1>ptxas info : Used 31 registers, 128+0 bytes smem, 68 bytes cmem[0], 76 bytes cmem[16]

Some progress on this subject: at least in my case, most of the speed was lost because I was using exp() instead of expf(), although everything was in float anyway.

If memory serves, there is a known issue in CUDA 4.0 (on Windows only, I think) where invoking a math library function with a float argument via the generic name function name does not map to the single-precision implementation like it should. Sorry for the inconvenience.

The workaround is to invoke the single-precision version explicitly, by suffixing the generic function name with ‘f’ (this is the traditional C way). It seems you already found this fix by yourself.

I received help from one of your colleagues on that. It’s impressive how slower it is when working on double.