This is the output from CUDA 10 on a PNY P5000 GPU (RHEL 7.1) running sample “matrixMul” with several sizes and then sample bandwidthTest.

There is a little difference with ECC on the matrix multiplication (~1-2%), and almost none in the host-device memory transfers, but in the device-device memory transfers the difference is 20%?

Can anyone explain why ECC should make such a difference?

ecc_pny_p5000.log:Performance= 1155.51 GFlop/s, Time= 1.858 msec, Size= 2147483648 Ops, WorkgroupSize= 1024 threads/block

ecc_pny_p5000.log:Performance= 1220.92 GFlop/s, Time= 14.071 msec, Size= 17179869184 Ops, WorkgroupSize= 1024 threads/block

ecc_pny_p5000.log:Performance= 1231.05 GFlop/s, Time= 111.643 msec, Size= 137438953472 Ops, WorkgroupSize= 1024 threads/block

ecc_pny_p5000.log:Performance= 1220.37 GFlop/s, Time= 900.969 msec, Size= 1099511627776 Ops, WorkgroupSize= 1024 threads/block

ecc_pny_p5000.log: 33554432 11445.3

ecc_pny_p5000.log: 33554432 12499.6

ecc_pny_p5000.log: 33554432 203528.5

no_ecc_pny_p5000.log:Performance= 1183.81 GFlop/s, Time= 1.814 msec, Size= 2147483648 Ops, WorkgroupSize= 1024 threads/block

no_ecc_pny_p5000.log:Performance= 1228.49 GFlop/s, Time= 13.984 msec, Size= 17179869184 Ops, WorkgroupSize= 1024 threads/block

no_ecc_pny_p5000.log:Performance= 1237.32 GFlop/s, Time= 111.078 msec, Size= 137438953472 Ops, WorkgroupSize= 1024 threads/block

no_ecc_pny_p5000.log:Performance= 1229.63 GFlop/s, Time= 894.182 msec, Size= 1099511627776 Ops, WorkgroupSize= 1024 threads/block

no_ecc_pny_p5000.log: 33554432 11453.8

no_ecc_pny_p5000.log: 33554432 12499.6

no_ecc_pny_p5000.log: 33554432 253295.1