Why using [&] is 10% slower than [=] when wrapping a kernel function call in a lambda expression?

I need to elaborate more on my original question, in which I want to compare whether 2D array indexing would seriously impact the performance. But, I failed to apply a 2D array using cudaMalloc, as CUDA_2d_cudaMalloc.

After reading some forum posts, like

we finally came to the conclusion that it is not supported in CUDA to index a 2D array in CPP-style like “d_array[i][j]”.

I am still really grateful for @Robert_Crovella 's guide, which inspires me a lot.