cudaMemcpy latency unusually high on some machines

when people are fiddling with WDDM performance, I often suggest to try both settings of Hardware Accelerated GPU Scheduling, see here for a recent example. I don’t know if it would have any effect here.

The low level performance tool I would suggest to possibly get more understanding would be nsight systems.