Code compiled in CUDA 4.0 slower than CUDA 3.1

fa1alpej · May 20, 2011, 8:08am

Hello,

I installed CUDA 4.0 without problems in Debian 6.0 using “cudatoolkit_4.0.13_linux_32_ubuntu10.10.run” and “devdriver_4.0_linux_32_270.40.run”. I can compile all examples in “gpucomputingsdk_4.0.13_linux.run” and they run correctly.

However, when I compile one program that I wrote to simulate propagation in the Cubic-Quintic Complex Ginzburg-Landau Equation, program compiled with CUDA 4.0 is slower than the same code compiled using CUDA 3.1 (80 s from CUDA 4.0 against 50 s from CUDA 3.1).

The Makefile content is:

Add source files here

EXECUTABLE := cgle2dcuda_sm20

Cuda source files (compiled with cudacc)

CUFILES_sm_20 := cgle2dcuda_sm20.cu

Additional compiler flags and LIBs to include

USECUFFT := 1
################################################################################

Rules and targets

include ../../common/common.mk

I used cufftExecZ2Z functions to pass from time space to Fourier space, both forward and backward, and two kernels to make propagation calculations. An example of the code we used is:

	             cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datau, (cufftDoubleComplex*)d_datau, CUFFT_FORWARD);
		     cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datav, (cufftDoubleComplex*)d_datav, CUFFT_FORWARD);


		     propagateData<<<NBLOCK,NHILOS>>>(d_datau,d_datav, N_SIZE, d_D11u,d_D11v,h_c[3]);

		     cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datau, (cufftDoubleComplex*)d_datau, CUFFT_INVERSE);
		     cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datav, (cufftDoubleComplex*)d_datav, CUFFT_INVERSE);

and propagateData function is:

	static __global__ void propagateData(cufftDoubleComplex* a, cufftDoubleComplex* b, int size, cuDoubleComplex* D11a, cuDoubleComplex* D11b, double d_c)
	{
		double norm=1./(double)size;
		double paso=0.0001; 
		const int numThreads = blockDim.x * gridDim.x;
		const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
		for (int i = threadID; i < size; i += numThreads){
		a[i]=propagateLineal(a[i],D11a[i],paso, d_c, norm);
		b[i]=propagateLineal(b[i],D11b[i],paso, d_c, norm);
		}
	}

Anybody knows why CUDA 4.0 compiled code run slower than CUDA 3.1 compiled code or what am I doing wrong? I used a GeForce GTX470 card.

Thanks.

sebgur · June 18, 2011, 4:33pm

Hi,

I have the same problem but under windows 7. A code compiled under 3.1 runs twice faster than recompiled under 4.0.

I tried compiling under 4.0 and linking to 3.1 libraries, but that doesn’t seem to help.

Nobody else has observed this?

Hello,

I installed CUDA 4.0 without problems in Debian 6.0 using “cudatoolkit_4.0.13_linux_32_ubuntu10.10.run” and “devdriver_4.0_linux_32_270.40.run”. I can compile all examples in “gpucomputingsdk_4.0.13_linux.run” and they run correctly.

However, when I compile one program that I wrote to simulate propagation in the Cubic-Quintic Complex Ginzburg-Landau Equation, program compiled with CUDA 4.0 is slower than the same code compiled using CUDA 3.1 (80 s from CUDA 4.0 against 50 s from CUDA 3.1).

The Makefile content is:

Add source files here

EXECUTABLE := cgle2dcuda_sm20

Cuda source files (compiled with cudacc)

CUFILES_sm_20 := cgle2dcuda_sm20.cu

Additional compiler flags and LIBs to include

USECUFFT := 1

################################################################################

Rules and targets

include …/…/common/common.mk

I used cufftExecZ2Z functions to pass from time space to Fourier space, both forward and backward, and two kernels to make propagation calculations. An example of the code we used is:
	             cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datau, (cufftDoubleComplex*)d_datau, CUFFT_FORWARD);

		     cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datav, (cufftDoubleComplex*)d_datav, CUFFT_FORWARD);

		     propagateData<<<NBLOCK,NHILOS>>>(d_datau,d_datav, N_SIZE, d_D11u,d_D11v,h_c[3]);

		     cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datau, (cufftDoubleComplex*)d_datau, CUFFT_INVERSE);

		     cufftExecZ2Z(plan, (cufftDoubleComplex*)d_datav, (cufftDoubleComplex*)d_datav, CUFFT_INVERSE);
and propagateData function is:
	static __global__ void propagateData(cufftDoubleComplex* a, cufftDoubleComplex* b, int size, cuDoubleComplex* D11a, cuDoubleComplex* D11b, double d_c)

	{

		double norm=1./(double)size;

		double paso=0.0001; 

		const int numThreads = blockDim.x * gridDim.x;

		const int threadID = blockIdx.x * blockDim.x + threadIdx.x;

		for (int i = threadID; i < size; i += numThreads){

		a[i]=propagateLineal(a[i],D11a[i],paso, d_c, norm);

		b[i]=propagateLineal(b[i],D11b[i],paso, d_c, norm);

		}

	}
Anybody knows why CUDA 4.0 compiled code run slower than CUDA 3.1 compiled code or what am I doing wrong? I used a GeForce GTX470 card.

Thanks.

sebgur · June 18, 2011, 5:08pm

I tried 3.2 and it’s as fast as 3.1. So something happened between 3.2 and 4.0. I hope that the guys at NVIDIA will fix it or tell us what we do wrong, but for now, I’ll stay under 3.2.

tera · June 19, 2011, 11:26am

How many registers do the kernels use?

sebgur · June 19, 2011, 12:10pm

I’m pretty new to CUDA so I don’t know how to check that. Is there some tool to get that information?

tera · June 19, 2011, 2:49pm

Compile with [font=“Courier New”]–ptxas-options=-v[/font] added to nvcc’s command line.

sebgur · July 30, 2011, 10:08am

Ok, thanks for the tip. So the --ptxas options gives me the following additional information (under CUDA 3.2)

1>ptxas info : Compiling entry function ‘_Z15sobolGPU_kerneljjPjPf’ for ‘sm_10’

1>ptxas info : Used 9 registers, 144+16 bytes smem, 20 bytes cmem[1]

.

1>ptxas info : Compiling entry function ‘_Z15sobolGPU_kerneljjPjPf’ for ‘sm_20’

1>ptxas info : Used 15 registers, 128+0 bytes smem, 48 bytes cmem[0]

.

1>ptxas info : Compiling entry function ‘_Z12mcGPU_kernelPfS_jjjPjS_S_S_i’ for ‘sm_10’

1>ptxas info : Used 32 registers, 176+16 bytes smem, 92 bytes cmem[1]

.

1>ptxas info : Compiling entry function ‘_Z12mcGPU_kernelPfS_jjjPjS_S_S_i’ for ‘sm_20’

1>ptxas info : Used 32 registers, 128+0 bytes smem, 72 bytes cmem[0], 64 bytes cmem[16]

Does this tell you something? Thanks for the help,

sebgur

tera · July 30, 2011, 11:10am

And does the register use increase under CUDA 4.0?

sebgur · August 20, 2011, 1:57pm

yes, the compilation in 4.0 uses slightly more registers.

Here are samples of the compilation report in 3.2

1>sobol_gpu.cu

1>ptxas info : Compiling entry function ‘_Z15sobolGPU_kerneljjPjPf’ for ‘sm_20’

1>ptxas info : Used 15 registers, 128+0 bytes smem, 48 bytes cmem[0]

1>ptxas info : Compiling entry function ‘_Z15sobolGPU_kerneljjPjPf’ for ‘sm_10’

1>ptxas info : Used 9 registers, 144+16 bytes smem, 20 bytes cmem[1]

1>simplemc_gpu.cu

1>ptxas info : Compiling entry function ‘_Z12mcGPU_kernelPfS_jjjPjS_S_f’ for ‘sm_20’

1>ptxas info : Used 27 registers, 128+0 bytes smem, 68 bytes cmem[0], 64 bytes cmem[16]

1>ptxas info : Compiling entry function ‘_Z12mcGPU_kernelPfS_jjjPjS_S_f’ for ‘sm_10’

1>ptxas info : Used 25 registers, 176+16 bytes smem, 88 bytes cmem[1]

and the same ting in 4.0 (the orders of sm_10 and sm_20 are inverted in 4.0)

1>sobol_gpu.cu

1>ptxas info : Compiling entry function ‘_Z15sobolGPU_kerneljjPjPf’ for ‘sm_10’

1>ptxas info : Used 9 registers, 144+16 bytes smem, 20 bytes cmem[1]

1>ptxas info : Compiling entry function ‘_Z15sobolGPU_kerneljjPjPf’ for ‘sm_20’

1>ptxas info : Function properties for _Z15sobolGPU_kerneljjPjPf

1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

1>ptxas info : Used 16 registers, 128+0 bytes smem, 48 bytes cmem[0]

1>simplemc_gpu.cu

1>ptxas info : Compiling entry function ‘_Z12mcGPU_kernelPfS_jjjPjS_S_f’ for ‘sm_10’

1>ptxas info : Used 25 registers, 176+16 bytes smem, 88 bytes cmem[1]

1>ptxas info : Compiling entry function ‘_Z12mcGPU_kernelPfS_jjjPjS_S_f’ for ‘sm_20’

1>ptxas info : Function properties for _Z12mcGPU_kernelPfS_jjjPjS_S_f

1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

1>ptxas info : Used 31 registers, 128+0 bytes smem, 68 bytes cmem[0], 76 bytes cmem[16]

sebgur · September 20, 2011, 10:47pm

Some progress on this subject: at least in my case, most of the speed was lost because I was using exp() instead of expf(), although everything was in float anyway.

njuffa · September 21, 2011, 8:52am

If memory serves, there is a known issue in CUDA 4.0 (on Windows only, I think) where invoking a math library function with a float argument via the generic name function name does not map to the single-precision implementation like it should. Sorry for the inconvenience.

The workaround is to invoke the single-precision version explicitly, by suffixing the generic function name with ‘f’ (this is the traditional C way). It seems you already found this fix by yourself.

sebgur · September 21, 2011, 10:01am

I received help from one of your colleagues on that. It’s impressive how slower it is when working on double.

Topic		Replies	Views
Cuda 4.0 decreases speed? CUDA Programming and Performance	3	736	October 11, 2011
Bad performance using VS 2010 + CUDA 4.0 CUDA Programming and Performance	2	1072	July 19, 2011
64 vs 32 bit Why 64 bit code is significantly slower than 32 bit code? CUDA Programming and Performance	19	4459	October 11, 2010
CUDA v4.1 substantially slower than v4.0 CUDA Programming and Performance	10	18286	February 12, 2012
Slow perfomance Runtime 4.1 CUDA Programming and Performance	2	736	February 9, 2012
Same code 10 times slower with CUDA 4.0 than with CUDA 3.0 CUDA Programming and Performance	14	8439	June 10, 2011
Wrong results with cuda 3.1 CUDA Programming and Performance	16	5944	August 15, 2010
Slow CUDA kernels/programs in cuda 4.0 CUDA Programming and Performance	2	575	January 24, 2012
Why is CUDA 4.1RC about 10-15% slower than 4.0? CUDA Programming and Performance	15	2406	December 19, 2011
My GPU Became Slower... after 1 month of not testing cuda CUDA Programming and Performance	18	12382	August 23, 2010

Code compiled in CUDA 4.0 slower than CUDA 3.1

Add source files here

Cuda source files (compiled with cudacc)

Additional compiler flags and LIBs to include

Rules and targets

Related topics