Your evaluate() function probably performs an out of bounds shared memory access, so when you compile it in, the kernel aborts early with an error. Because you have no error checking, you just don’t see the error.
Yes. If any thread peforms an out of bounds shared memory access, the entire grid is aborted and the API will return an error. It is the only thing I can think of which would cause the run time with additional code to be smaller than the bare memory accesses.
Preventing it means fixing your code and I can’t tell you how to do that. But detection can be done by running the code with cuda-memcheck.
Passing local variables by pointer or reference to device functions should be fine. While a concurrent write or race might cause unpredictable results, it won’t necessarily cause execution problems. I say necessarily because there are still lots of ways a race on a pointer or index variable might lead to out of bounds memory access. It is really impossible to say more without seeing code.