How to get rid of __device__ printf() in CUDA 3.1 ? Looks like it adds 9 registers to kernel require

After the upgrade from CUDA 3.0 to CUDA 3.1 I’ve figured out that my kernel when compiled for sm_20 architecture consumes 54 registers instead of 45 as it used to under CUDA 3.0. The number of registers for sm_13 architecture remains the same.

Also, it is compiled with the following warnings:
\3.1_64\toolkit\include\common_functions.h(73): warning: dllexport/dllimport conflict with “printf”
1>C:\Program Files (x86)\Microsoft Visual Studio 8\VC\INCLUDE\stdio.h(278): here; dllimport/dllexport dropped

What can be done about all this mess ? It looks like device-side printf() requires a lot of resources (as cuPrintf() does), however, it was easy to exclude cuPrintf from the build and use it only when it is actually required …

Thanks in advance.

Are you sure that your register usage did not go up because of 64bit pointers in sm_20 on 64bit architecture? I happened to notice your directory was called 3.1_64 …

3.1_64 is the version of CUDA, correct. However, I compile with -m32 and the code is 32 bit. Either way, absolutely the same code with CUDA 3.0 requires 45 registers, not 54.