cuda 3.2 slower than cuda 2.0 ?

Hi,
two years ago we developed an application for neural networks training using CUDA 2.0 with GTX295 (using single GPU).
In the application we maily use CuBlas SGemm for matrix products and some kernels.
Now we wanted to update the environment, so we have downloaded CUDA 3.2 and recompiled the application code.
We have compared peformances of the code compiled with CUDA 2.0 and CUDA 3.2 on the GTX295 board and the problem is that the code compiled using CUDA 3.2 is 15% slower than the same code compiled with CUDA 2.0.
The host machine is a HP xw8600 workstation installed with radhat 64 bit.
Do you have any ideas of how that could happen?

From “Tuning CUDA Applications for Fermi” I understood that the problem could be the 32-bit versus 64-bit Device Code.
The document writes:
If you build your application in 64-bit mode (either by passing -m64 to nvcc or by specifying neither –m64 nor –m32 when compiling on a 64-bit machine), e.g., to gain access to more than 4GB of system memory, be aware that nvcc will compile both the host code and the device code in 64-bit mode for devices of compute capability 2.0. While this works, the larger pointers in the device code incur a performance penalty for the device (because of the extra space those pointers occupy in the register file, among other reasons). If you are not targeting GPUs with large amounts of video memory that can take advantage of a 64-bit address space, then this performance penalty is unnecessary. To avoid it, you should separate out the compilation of your host code from your device code and compile the device code in 32-bit mode.

Do you think that could be the problem?
How can I do to “separate out the compilation of your host code from your device code and compile the device code in 32-bit mode”? What CuBlas lib should I link? lib or lib64?

Thank you for your help.

Hi,
two years ago we developed an application for neural networks training using CUDA 2.0 with GTX295 (using single GPU).
In the application we maily use CuBlas SGemm for matrix products and some kernels.
Now we wanted to update the environment, so we have downloaded CUDA 3.2 and recompiled the application code.
We have compared peformances of the code compiled with CUDA 2.0 and CUDA 3.2 on the GTX295 board and the problem is that the code compiled using CUDA 3.2 is 15% slower than the same code compiled with CUDA 2.0.
The host machine is a HP xw8600 workstation installed with radhat 64 bit.
Do you have any ideas of how that could happen?

From “Tuning CUDA Applications for Fermi” I understood that the problem could be the 32-bit versus 64-bit Device Code.
The document writes:
If you build your application in 64-bit mode (either by passing -m64 to nvcc or by specifying neither –m64 nor –m32 when compiling on a 64-bit machine), e.g., to gain access to more than 4GB of system memory, be aware that nvcc will compile both the host code and the device code in 64-bit mode for devices of compute capability 2.0. While this works, the larger pointers in the device code incur a performance penalty for the device (because of the extra space those pointers occupy in the register file, among other reasons). If you are not targeting GPUs with large amounts of video memory that can take advantage of a 64-bit address space, then this performance penalty is unnecessary. To avoid it, you should separate out the compilation of your host code from your device code and compile the device code in 32-bit mode.

Do you think that could be the problem?
How can I do to “separate out the compilation of your host code from your device code and compile the device code in 32-bit mode”? What CuBlas lib should I link? lib or lib64?

Thank you for your help.

Hi,

yes it could. The same can appear even on a pure cpu application.
As a test, you could compile your app in 32 bits while using CUDA 3.2.

– pium

Hi,

yes it could. The same can appear even on a pure cpu application.
As a test, you could compile your app in 32 bits while using CUDA 3.2.

– pium

The problem is that the application must to gain access to more than 4GB of system memory, because the training data are huge, so I cannot compile the host code in 32 bits. I’d like to compile the host code in 64 bits and the device code in 32 bits (including CuBlas), but I do not know how to do. Any idea about nvcc parameters to use?

The problem is that the application must to gain access to more than 4GB of system memory, because the training data are huge, so I cannot compile the host code in 32 bits. I’d like to compile the host code in 64 bits and the device code in 32 bits (including CuBlas), but I do not know how to do. Any idea about nvcc parameters to use?

The previous code using CUDA 2.0 was 32bits or 64?
On my opinion, you need to change only the version of CUDA if you want to compare CUDA performance.
On the other way, you can try to compile in 64bits with CUDA 2.0 :p

I read somewhere that to compile cuda code in 32bits on a 64bits app you have to use a lower-level API (than CUDA C), but I do not know more.

The previous code using CUDA 2.0 was 32bits or 64?
On my opinion, you need to change only the version of CUDA if you want to compare CUDA performance.
On the other way, you can try to compile in 64bits with CUDA 2.0 :p

I read somewhere that to compile cuda code in 32bits on a 64bits app you have to use a lower-level API (than CUDA C), but I do not know more.

Yes:

  • Previous code: compiled 64 bit with CUDA 2.0

  • New code: compiled 64 bit with CUDA 3.2

New code performance: -16% speed

The application uses heavily CuBlas SGemm and some dedicated kernels.

Yes:

  • Previous code: compiled 64 bit with CUDA 2.0

  • New code: compiled 64 bit with CUDA 3.2

New code performance: -16% speed

The application uses heavily CuBlas SGemm and some dedicated kernels.

ah ok. So it sounds like your slow down comes from another problem. Because the 64bits issue is also true for CUDA2.0.

ah ok. So it sounds like your slow down comes from another problem. Because the 64bits issue is also true for CUDA2.0.