I have logs of this nature found in journalctl.
NVRM: Xid (PCI:0000:48:00): 48, pid=‘’, name=, An uncorrectable double bit error (DBE) has been detected on GPU in the L2 cache at cache 6, slice 1.
I’ve been trying to make sense of the differing cache #s and slice #s and how it makes sense with regards to how an H100 makes use of its L2 Cache. Most of my reading that could relate to this refers to the Ampere architecture whitepaper where it states that each L2 partition has 40 L2 cache slices. What NVIDIA reading I’ve found via search engine with respect to the Hopper architecture didn’t contain L2 cache slice details.
Would appreciate some insight.