Pascal (PNY P5000) - enabling ECC slows down memory accesses 20%?

This is the output from CUDA 10 on a PNY P5000 GPU (RHEL 7.1) running sample “matrixMul” with several sizes and then sample bandwidthTest.

There is a little difference with ECC on the matrix multiplication (~1-2%), and almost none in the host-device memory transfers, but in the device-device memory transfers the difference is 20%?

Can anyone explain why ECC should make such a difference?

ecc_pny_p5000.log:Performance= 1155.51 GFlop/s, Time= 1.858 msec, Size= 2147483648 Ops, WorkgroupSize= 1024 threads/block
ecc_pny_p5000.log:Performance= 1220.92 GFlop/s, Time= 14.071 msec, Size= 17179869184 Ops, WorkgroupSize= 1024 threads/block
ecc_pny_p5000.log:Performance= 1231.05 GFlop/s, Time= 111.643 msec, Size= 137438953472 Ops, WorkgroupSize= 1024 threads/block
ecc_pny_p5000.log:Performance= 1220.37 GFlop/s, Time= 900.969 msec, Size= 1099511627776 Ops, WorkgroupSize= 1024 threads/block
ecc_pny_p5000.log: 33554432 11445.3
ecc_pny_p5000.log: 33554432 12499.6
ecc_pny_p5000.log: 33554432 203528.5

no_ecc_pny_p5000.log:Performance= 1183.81 GFlop/s, Time= 1.814 msec, Size= 2147483648 Ops, WorkgroupSize= 1024 threads/block
no_ecc_pny_p5000.log:Performance= 1228.49 GFlop/s, Time= 13.984 msec, Size= 17179869184 Ops, WorkgroupSize= 1024 threads/block
no_ecc_pny_p5000.log:Performance= 1237.32 GFlop/s, Time= 111.078 msec, Size= 137438953472 Ops, WorkgroupSize= 1024 threads/block
no_ecc_pny_p5000.log:Performance= 1229.63 GFlop/s, Time= 894.182 msec, Size= 1099511627776 Ops, WorkgroupSize= 1024 threads/block
no_ecc_pny_p5000.log: 33554432 11453.8
no_ecc_pny_p5000.log: 33554432 12499.6
no_ecc_pny_p5000.log: 33554432 253295.1

Generally speaking matrix multiply is a compute-bound activity, while memory copies are obviously bound by memory throughput.

As opposed to CPUs, where ECC information is processed out-of-band, by using a wider interface when ECC is used (1 additional bit for every 8 bits of data), ECC information on GPUs is processed in-band: the interface remains the same width regardless whether ECC is disabled or enabled, and the shuffling around of additional ECC data takes bandwidth away from user code.

I am a bit surprised that you are observing a 20% reduction in user bandwidth, as I was under the impression that ECC efficiency had been improved in modern GPUs such that the overhead generally is under 10%.

Looking more closely, and assuming I interpret your data correctly, you are observing a 20% degradation only in one specific case, and this may well represent some sort of worst case scenario.

Would you have a reference that states that NVIDIA GPUs handle ECC “in-band”?

And the sample bandwidthTest may represent a worst-case - the fact that ECC-checked GPU memory is being simultaneously read and written sequentially may be a cause.

I don’t have a reference handy, and your Google-fu is probably no worse than mine.

Your log is a bit hard to interpret, but it looks to me like you are not observing a 20% degradation for all transfers sizes in bandwidthTest, just a particular one.

This output is just the results from the CUDA sample “bandwidthTest”; the device-to-device case is the one that exhibits the 20% difference between ECC and non-ECC:

$ $HOME/cuda/10/1*/bandwidthTest/bandwidthTest
[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: Quadro P5000
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 11450.4

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12499.6

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 253154.2

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Regarding “Google-fu”, I had not been having a lot of luck finding these details.

A recent doc from the DRIVE product (products NVIDIA is targeting at the automotive industry, where ECC may be the difference between a passenger dying or not):

It does confirm your “in-band” description. I’ve been dealing with ECC memories for (oh God, really?) 38 years and this is the first time I’ve found ECC handled “in-band”. So, perhaps, I read this someplace but simply refused to believe it could be true.

And here’s the money quote:

DRAM Bandwidth Impact

When DRAM ECC is enabled, you have an overhead of read/write ECC bytes along with data bytes. This overhead has an impact in DRAM bandwidth when DRAM ECC is enabled. However, this DRAM bandwidth impact with DRAM ECC enabled boot is proportional to the bandwidth consumption. DRAM bandwidth impact of 10%-12% is visible only when the bandwidth consumption is greater than 100 GBps (GigaBytes per second).

So for the sample I’m quoting, the device-to-device bandwidth test is the only one that exceeds 100GBps.

This is the place where new tricks are taught :-)

It has been like that ever since ECC support was added to (some) GPUs a dozen years ago. The reason is presumably that GDDR memory is a much smaller market than DDR, so DRAM manufacturers do not want to make a niche product “GDDR with ECC bits”. There may also be technical issues, such as different internal organization for GDDR that make it harder and therefore costlier to add ECC bits, but that is speculation on my part.

The additional ECC bits for GPUs eat up some of the available RAM attached to the GPU. I think currently that is 6.5%, but I may misremember. Especially in intense read/write scenarios, the bandwidth required to update the ECC bits takes away from user memory bandwidth, as described in the documentation quote you found. And apparently the maximum bandwidth impact of ECC can still be 20% at this time :-(

Even without ECC, in practice one can get only about 80% of theoretical bandwidth on both CPUs and GPUs, as you are likely aware.

Not really relevant to this discussion, but GPUs equipped with HBM2 memory handle ECC out-of-band. For this reason there is little or no performance impact to enabling ECC on e.g. Tesla V100, P100.

This was the second hit when I googled “gpu dram ecc bandwidth”

“Like Kepler GK210, the GP100 GPU’s register files, shared memories, L1 and L2 caches, and DRAM are all protected by Single-Error Correct Double-Error Detect (SECDED) ECC code. When enabling ECC support on a Kepler GK210, the available DRAM would be reduced by 6.25% to allow for the storage of ECC bits. Fetching ECC bits for each memory transaction also reduced the effective bandwidth by approximately 20% compared to the same GPU with ECC disabled. HBM2 memories, on the other hand, provide dedicated ECC resources, allowing overhead-free ECC protection.”

PNY Quadro P5000 is not an HBM2 GPU, of course, so its memory characteristics are similar to GK210, and not similar to HBM2-equipped Pascal variants including Tesla P100, and Quadro GP100.

I’m thinking that I just didn’t believe anybody would do it this way, so I just couldn’t understand the answer. I think I understand now.

Thank you for helping me to see what should have been obvious.

I’d gladly give up 20% if the other choice is “crash and burn” (meant quite literally). I’m gratified that the automotive folks have put this on the front burner.

There are many areas outside of automotive where I personally believe ECC to be necessary. Best I know, these days GPUs are used in a lot of medical imaging applictions, and I would not want a radiologist to have to guess whether he is looking at a cancerous lesion or an image artifact caused by a cosmic ray knocking out a DRAM bit.

I wasn’t aware that HBM2 provides integrated ECC support, so that’s good knowledge to have.