Question to OpenCL nvidia's example from OpenCL Programming Overview whitepaper

I have some question to the example code (the last example in OpenCL Programming Overview whitepaper - the same as the last kernel from MatVecMul from OpenCL examples) - kernel with warp synchronization.

There is a line in kernel code that i don’t understand precisely:

__local float* p = partialDotProduct + 2 * get_local_id(0) - id;

If we have for example work group of size 256 threads and the size of partialDotProduct is the same then above code will go out of size of p. For example p + 2*255 - 32 = p[478] so its out of range of p which is just 256 elements. Is it ok? What happens with memory - isn’t it overwritten? Or maybe threads that want compute elements out of range aren’t executed? Can you explain it to me? Thanks in advance