64-bit versus 32-bit CUDA code Any benefit at all?

I am currently developping an application under 64-bit Windows 7 environment.
When I generate 64-bit code, cpp files are processed by Visual C++ compiler and .cu files go to nvcc (with -m64 option).

Obviously everything works fine, but in FAQ (and some other sources) we can read that:
“Current GPUs are essentially 32-bit machines, but they do support 64 bit integers (long longs). Operations on these types compile to multiple instruction sequences.”

This suggests the following question:
Q1: Is that true that 64-bit code produced from nvcc will be needlessly slower and will consume more resources (in particular - registers) than if I tried -m32 option, for example - when dealing with pointers?

That is why I tried compiling cpp into 64-bit code, but running nvcc with -m32 option. That lead to an obvious error from the linker:
“fatal error LNK1112: module machine type ‘X86’ conflicts with target machine type ‘x64’”
I would like not to transform my whole application into 32-bit version which is an obvious, yet not satisfactory walkaround.
Q2: Any ideas on how to launch 32-bit kernels from 64-bit host program?

One of my idea is to split the code into 2 parts:
a) Pure device code (global and device functions)
b) Pure host code which invokes the kernels using runtime api.

Code (a) would be compiled into 32-bit code and global function pointers would be made visible to the linker as a 64-bit unsigned integers, with value 0 set as a high part of the pointer.
Code (b) would be compiled into 64-bit code. I can allocate device memory here and prepare for the launch. Upon the launch, all pointer parameters would be demoted into 32-bit pointers. This obviously assumes that the high part of all pointers is always 0.
Q3: Is is the case that all device pointers (data and function pointers) can be expressed as 32-bit integers, even if created by a 64-bit application?

Finally a small question. I need an integer type (“native int”) which size matches the type of code being produced. For 32-bit, it is an int, for 64-bit it is a long long. A simple way is to simply typedef it somehwere, but:
Q4: Is there a predefined macro in nvcc which tells me if currently produced code is 32-bit or 64-bit?

You might find it worthwhile to read the nvcc manual. The key thing to understand is that nvcc isn’t a compiler, it is a compiler driver which orchestrates a sequence of compile steps using both the host compiler (gcc or visual C) and nVidia’s port of the open64 C compiler, nvopencc, which compiles device code. I think I right in saying that there is no such thing as 64 bit device code in present architectures. When you are telling nvcc to compile 64 bit code, you are really telling it to launch 64 bit compilation for the host compiler, which includes selecting 64 bit safe versions of the host side API functions (and probably to use the 64 bit version of the glue that nVidia use to encapsulate device assembler inside a host elf or dll file). When you are seeing host architecture conflicts during linking, it is because the -m32 option has brought 32bit host code into the object files hold the device code. Not the device code itself.

The simple rule of thumb is either build all 32bit or 64bit, depending on host side factors only. It has no effect on the device code that is produced.

Yes, I know that nvcc is not a compiler itself, that was my simplification to state the problem I am facing. I will be more than happy to set nvcc with -m64 option if I know that compiled kernels will be exactly the same… However…

I wish that was true. Unfortunately sizeof(int*) returns 8 on device side when compiled with -m64 option and return 4 when compiled with -m32 option.

I assume that sizeof() correctly represents sizes of types on the device side. The size of pointers depends, for example, when passing parameters to the kernel.

Try the following:

__global__ void getMeSize(int *result) {

	result[0]=sizeof(void*);

}

int main() {

	int *p;

	int h;

	cudaMalloc((void**)&p,sizeof(int));

	getMeSize<<<1,1>>>(p);

	cudaMemcpy(&h,p,sizeof(int),cudaMemcpyDeviceToHost);

	printf("Size is %d\n",h);

}

My output is:

Size is 8

Also note that setting --ptxas-options=-v shows difference of parameter size of the kernel (which are stored in shared memory)

64bit

1>ptxas info	: Compiling entry function '_Z9getMeSizePi'

1>ptxas info	: Used 2 registers, 8+16 bytes smem

32bit

1>ptxas info	: Compiling entry function '_Z9getMeSizePi'

1>ptxas info	: Used 2 registers, 4+16 bytes smem

Pointer representations on the device are guaranteed to be the same size as the host–if you compile with -m64, then sizeof(int*) will be eight bytes. However, GT200 is still a 32-bit architecture, so only four bytes are actually going to get used internally to represent the pointer.

Basically the whole point is so you can pass a pointer from the host to the device and have it Just Work. Doing goofy -m32 stuff is doable (with the driver API) but not recommended.

Yep, if you look at some of the generated code, -m32 will map all kernel pointer parameters to .u32 and -m64 will map them to .u64.

OK, thank you for claryfing it up to me.
After some more testing I learnt that global function pointers differ significantly depending on the target mode (32 or 64) and cannot be simple mapped one to another.
Since I have absolutely no idea how to tell nvcc machinery to compile global functions into 32bit code but then return 64bit pointer to it (the idea itself seems stupid, does it not? ;) ) therefore I will follow your advice not to do the “goofy -m32 stuff”.

I wonder if next generation of GPUs will be 64-bit or not…

Last question remains:
Q4: Is there a predefined macro in nvcc/nvopencc which tells me if currently produced code is 32-bit or 64-bit?
I checked nvcc documentation but I have seen only CUDACC and CUDA_ARCH.