OP mentioned near the start of the thread that they are using debug builds. Without any switches passed to nvcc, the compiler defaults to a fully optimized “release” build.
A debug build may well be a decimal order of magnitude slower than a fully-optimized release build, and the lowest end GPUs in a given architecture are often a decimal order of magnitude slower than the fastest ones. In addition, the Quadro K620 is also an older architecture than the GTX 1080Ti.
I don’t think the entry-level GPUs are designed for efficiency, i.e. optimizing performance / energy consumed. Rather they appear to be designed to hit specific price and performance points within a GPU generation, with the hardware cut down accordingly. As an example, for a long time they were still using slow but cheap DDR3 for GPU memory, and I think this applies to the Quadro K620.