Simple CUDA program hitting size limits/errors on Windows but not Linux

OP mentioned near the start of the thread that they are using debug builds. Without any switches passed to nvcc, the compiler defaults to a fully optimized “release” build.

A debug build may well be a decimal order of magnitude slower than a fully-optimized release build, and the lowest end GPUs in a given architecture are often a decimal order of magnitude slower than the fastest ones. In addition, the Quadro K620 is also an older architecture than the GTX 1080Ti.

I don’t think the entry-level GPUs are designed for efficiency, i.e. optimizing performance / energy consumed. Rather they appear to be designed to hit specific price and performance points within a GPU generation, with the hardware cut down accordingly. As an example, for a long time they were still using slow but cheap DDR3 for GPU memory, and I think this applies to the Quadro K620.

The GT650M I tested now (results above) has DDR5, but the same 128bit bus and core count, 384.
Unless the recent result from the OP above (4,22 seconds) refers to a debug build, then I wouldn’t imagine this gap between DDR3 and DDR5 since K620 theoretical bandwidth is 30GB/s and the GT650M is 90GB/s.

The recent numbers did indeed refer to a debug build.
/Joel

Then it was a pointless comparison.