No speed benefit gained from shared memory. Each element is exactly read once.
I propose to return either the Mean Square Error or the maximum error as a success metric, not just a flag. It is then up to the caller to determine whether or not the computation was successful (i.e. within bounds) or not.