Why would code run 1.7x faster when run with nvprof than without?

I don’t have a good hypothesis what could be going on relative to the GPU. It’s possible that use of the profiler could cause additional host/device synchronizations and that this might positively impact some hardware utilization imbalances, but at best I would expect this to be a minor effect on a Linux platform.

My best guess for now is that the time difference occurs in a non-GPU portion of the code. But it will be interesting to hear what you can track the additional time down to. A curious head-scratcher for sure!