Host code from .cu files is preprocessed by nvcc before it is passed to the host compiler, and this pre-processed host code seen by the host compiler will tend to differ from the same host code presented directly to the host compiler. In my experience any performance differences resulting from these source differences are minor, but I assume that it is not impossible for more pronounced performance differences to occur.
You could simply move the affected host code from the .cu file to a separate .cpp file that is compiled directly with the host compiler. You can also try passing additional optimization flags to the host compiler via the nvcc commandline. For example, if the host compiler is g++, you could try something like this:
-Xcompiler -O3 -Xcompiler -march=core2 -Xcompiler -mtune=core2 -Xcompiler -msse2
Which additional host compiler optimization flags make sense will depend on your code, the target platform, and the host toolchain.
I assume you already verified that the flags passed to the host compiler are either identical, or at least essentially the same, between the nvcc build and the separate host compiler build? Adding the -v switch to the nvcc commandline will cause it to show exactly how each underlying tools is invoked.