FROM THE NVIDIA “Tuning CUDA Applications for Fermi” MANUAL:
“32-bit versus 64-bit Device Code
If you build your application in 64-bit mode (either by passing -m64 to nvcc or by specifying neither â€“m64 nor â€“m32 when compiling on a 64-bit machine), e.g., to gain access
to more than 4GB of system memory, be aware that nvcc will compile both the host code and the device code in 64-bit mode for devices of compute capability 2.0. While this works,
the larger pointers in the device code incur a performance penalty for the device (because of the extra space those pointers occupy in the register file, among other reasons). If you are not targeting GPUs with large amounts of video memory that can take advantage of a 64-bit address space, then this performance penalty is unnecessary. To avoid it, you should
separate out the compilation of your host code from your device code and compile the device code in 32-bit mode.”
How do the above? How about a simple example?
My experience is the above does not work.
The cu files are included as if they were headers. The objects appear to be intermediate in nature. How can they be built using -m32 and then kept and later linked to the CPU code?
Thank you for your help