two years ago we developed an application for neural networks training using CUDA 2.0 with GTX295 (using single GPU).
In the application we maily use CuBlas SGemm for matrix products and some kernels.
Now we wanted to update the environment, so we have downloaded CUDA 3.2 and recompiled the application code.
We have compared peformances of the code compiled with CUDA 2.0 and CUDA 3.2 on the GTX295 board and the problem is that the code compiled using CUDA 3.2 is 15% slower than the same code compiled with CUDA 2.0.
The host machine is a HP xw8600 workstation installed with radhat 64 bit.
Do you have any ideas of how that could happen?
From “Tuning CUDA Applications for Fermi” I understood that the problem could be the 32-bit versus 64-bit Device Code.
The document writes:
If you build your application in 64-bit mode (either by passing -m64 to nvcc or by specifying neither â€“m64 nor â€“m32 when compiling on a 64-bit machine), e.g., to gain access to more than 4GB of system memory, be aware that nvcc will compile both the host code and the device code in 64-bit mode for devices of compute capability 2.0. While this works, the larger pointers in the device code incur a performance penalty for the device (because of the extra space those pointers occupy in the register file, among other reasons). If you are not targeting GPUs with large amounts of video memory that can take advantage of a 64-bit address space, then this performance penalty is unnecessary. To avoid it, you should separate out the compilation of your host code from your device code and compile the device code in 32-bit mode.
Do you think that could be the problem?
How can I do to “separate out the compilation of your host code from your device code and compile the device code in 32-bit mode”? What CuBlas lib should I link? lib or lib64?
Thank you for your help.