I have a simple iterator that I’m trying to use to pass into a scan function (as found in modernGPU or CUB) to do a scatter_if while avoiding writing the result of the scan back to memory.
For some reason it uses a lot of registers and local memory compared with a direct implementation of the same functionality. Can anybody point out to me why? I’ve tested with both 5.0 and 5.5.
Iterator compiles with 20 registers and 48 bytes lmem.
Direct implementation uses 5 registers and 0 bytes lmem.
Depending on how exactly the code is compiled [e.g. debug build or separate compilation], the lmem usage could be for the stack frame of a called function, and the higher register usage could be a side effect of the ABI function call interface. That’s just speculation at this point, there is not enough data to diagnose what might be going on. Is the use of lmem or additional registers causing problems?
Building code for a 64-bit platform forces all pointers (including those generated by the compiler via strength reduction and induction variable creation), so it is not unusual for device code to use more registers when compiled for a 64-bit platform, when comapred to a 32-bit build.
In general, I highly recommend the use of the ABI (which is the compiler default, and specifically disabled by passing -abi=no). Are you getting warnings from the compiler when using -abi=no? I am not up to date on the support status of compiling without use of the ABI.
Since you are already using CUDA 5.5, filing a bug seems like a good idea to get this sorted out. There should not be any differences between code egenrated in Windows and Linux, unless platform specific type differences come into play which are inherited by CUDA for reasons of interoperability with host code (in particular, the type ‘long’ is 32 bits on 64-bit Windows but 64 bits on 64-bit Linux).