Your pitch calculations for indexing are not correct.
Please refer to the documentation:
[url]CUDA Runtime API :: CUDA Toolkit Documentation
The pitch value returned by cudaMallocPitch is a quantity in bytes
Even when you fix that, the pitched method may not give any better performance than the unpitched method. Pitched allocations were especially useful on early GPUs, but are of less significance on modern GPUs. Depending on your GPU, its possible that the overhead associated with pitch calculations in the kernel (especially for such a simple kernel) may outweigh any benefit from pitched access (although it should not cause a ~50x performance reduction)