cuda 3.2 slower than cuda 2.0 ?

neuron · November 3, 2010, 10:35am

Hi,
two years ago we developed an application for neural networks training using CUDA 2.0 with GTX295 (using single GPU).
In the application we maily use CuBlas SGemm for matrix products and some kernels.
Now we wanted to update the environment, so we have downloaded CUDA 3.2 and recompiled the application code.
We have compared peformances of the code compiled with CUDA 2.0 and CUDA 3.2 on the GTX295 board and the problem is that the code compiled using CUDA 3.2 is 15% slower than the same code compiled with CUDA 2.0.
The host machine is a HP xw8600 workstation installed with radhat 64 bit.
Do you have any ideas of how that could happen?

From “Tuning CUDA Applications for Fermi” I understood that the problem could be the 32-bit versus 64-bit Device Code.
The document writes:
If you build your application in 64-bit mode (either by passing -m64 to nvcc or by specifying neither â€“m64 nor â€“m32 when compiling on a 64-bit machine), e.g., to gain access to more than 4GB of system memory, be aware that nvcc will compile both the host code and the device code in 64-bit mode for devices of compute capability 2.0. While this works, the larger pointers in the device code incur a performance penalty for the device (because of the extra space those pointers occupy in the register file, among other reasons). If you are not targeting GPUs with large amounts of video memory that can take advantage of a 64-bit address space, then this performance penalty is unnecessary. To avoid it, you should separate out the compilation of your host code from your device code and compile the device code in 32-bit mode.

Do you think that could be the problem?
How can I do to “separate out the compilation of your host code from your device code and compile the device code in 32-bit mode”? What CuBlas lib should I link? lib or lib64?

Thank you for your help.

neuron · November 3, 2010, 10:35am

Hi,
two years ago we developed an application for neural networks training using CUDA 2.0 with GTX295 (using single GPU).
In the application we maily use CuBlas SGemm for matrix products and some kernels.
Now we wanted to update the environment, so we have downloaded CUDA 3.2 and recompiled the application code.
We have compared peformances of the code compiled with CUDA 2.0 and CUDA 3.2 on the GTX295 board and the problem is that the code compiled using CUDA 3.2 is 15% slower than the same code compiled with CUDA 2.0.
The host machine is a HP xw8600 workstation installed with radhat 64 bit.
Do you have any ideas of how that could happen?

From “Tuning CUDA Applications for Fermi” I understood that the problem could be the 32-bit versus 64-bit Device Code.
The document writes:
If you build your application in 64-bit mode (either by passing -m64 to nvcc or by specifying neither â€“m64 nor â€“m32 when compiling on a 64-bit machine), e.g., to gain access to more than 4GB of system memory, be aware that nvcc will compile both the host code and the device code in 64-bit mode for devices of compute capability 2.0. While this works, the larger pointers in the device code incur a performance penalty for the device (because of the extra space those pointers occupy in the register file, among other reasons). If you are not targeting GPUs with large amounts of video memory that can take advantage of a 64-bit address space, then this performance penalty is unnecessary. To avoid it, you should separate out the compilation of your host code from your device code and compile the device code in 32-bit mode.

Do you think that could be the problem?
How can I do to “separate out the compilation of your host code from your device code and compile the device code in 32-bit mode”? What CuBlas lib should I link? lib or lib64?

Thank you for your help.

pium · November 3, 2010, 11:04am

Hi,

yes it could. The same can appear even on a pure cpu application.
As a test, you could compile your app in 32 bits while using CUDA 3.2.

– pium

pium · November 3, 2010, 11:04am

Hi,

yes it could. The same can appear even on a pure cpu application.
As a test, you could compile your app in 32 bits while using CUDA 3.2.

– pium

neuron · November 3, 2010, 12:02pm

The problem is that the application must to gain access to more than 4GB of system memory, because the training data are huge, so I cannot compile the host code in 32 bits. I’d like to compile the host code in 64 bits and the device code in 32 bits (including CuBlas), but I do not know how to do. Any idea about nvcc parameters to use?

neuron · November 3, 2010, 12:02pm

The problem is that the application must to gain access to more than 4GB of system memory, because the training data are huge, so I cannot compile the host code in 32 bits. I’d like to compile the host code in 64 bits and the device code in 32 bits (including CuBlas), but I do not know how to do. Any idea about nvcc parameters to use?

pium · November 3, 2010, 12:58pm

The previous code using CUDA 2.0 was 32bits or 64?
On my opinion, you need to change only the version of CUDA if you want to compare CUDA performance.
On the other way, you can try to compile in 64bits with CUDA 2.0 :p

I read somewhere that to compile cuda code in 32bits on a 64bits app you have to use a lower-level API (than CUDA C), but I do not know more.

pium · November 3, 2010, 12:58pm

The previous code using CUDA 2.0 was 32bits or 64?
On my opinion, you need to change only the version of CUDA if you want to compare CUDA performance.
On the other way, you can try to compile in 64bits with CUDA 2.0 :p

I read somewhere that to compile cuda code in 32bits on a 64bits app you have to use a lower-level API (than CUDA C), but I do not know more.

neuron · November 3, 2010, 1:56pm

Yes:

Previous code: compiled 64 bit with CUDA 2.0
New code: compiled 64 bit with CUDA 3.2

New code performance: -16% speed

The application uses heavily CuBlas SGemm and some dedicated kernels.

neuron · November 3, 2010, 1:56pm

Yes:

Previous code: compiled 64 bit with CUDA 2.0
New code: compiled 64 bit with CUDA 3.2

New code performance: -16% speed

The application uses heavily CuBlas SGemm and some dedicated kernels.

pium · November 3, 2010, 2:14pm

ah ok. So it sounds like your slow down comes from another problem. Because the 64bits issue is also true for CUDA2.0.

pium · November 3, 2010, 2:14pm

ah ok. So it sounds like your slow down comes from another problem. Because the 64bits issue is also true for CUDA2.0.

Topic		Replies	Views
Cuda Portability and SharedMem vs Cache CUDA Programming and Performance	9	11623	October 18, 2010
any backward compatibility issue for CUDA 1.1? CUDA Programming and Performance	13	9625	December 21, 2007
CUDA very slow performance CUDA Programming and Performance	21	16768	March 6, 2020
Performance degradation in 7.0. Silly handling of constant memory in SASS vs 6.5 CUDA Programming and Performance	21	3586	April 2, 2015
GRID K520 board slower on CUDA 6 CUDA Programming and Performance	10	2217	June 13, 2014
CUDA 2.2 running half the speed of 2.1 CUDA Programming and Performance	6	8334	July 26, 2009
32-bit nvcc makes faster GPU code than 64-bit variant In CUDA version 2.1 CUDA Programming and Performance	9	10485	February 14, 2009
Mixed 32/64 compilation on 0.9 CUDA Programming and Performance	7	7799	August 4, 2007
My GPU Became Slower... after 1 month of not testing cuda CUDA Programming and Performance	18	12162	August 23, 2010
Fermi (2.0) cuda device on 64-bit Linux with 32bit device code CUDA Programming and Performance	3	10305	February 13, 2011

cuda 3.2 slower than cuda 2.0 ?

Related topics