As far as I understand the compilation process, tera’s explanation is right on the money. As an addendum, one reason CUDA_ARCH is undefined in host code is because for fatbinary compilation targeting multiple device architectures, host code is only compiled once, so it can’t be associated with any particular CUDA architecture.
The recommended way to check for the CUDA architecture in device code is something like this:
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 200)
In general CUDA architecture versions follow an onion-layer model, so the use of architectural features is usually best guarded by >= comparisons against CUDA_ARCH.