Does normal for loop in kernels takes more time than in host function

I am using kernels to process some char arrays simultaneously. for every thread I am using a for loop as I need to check if the current char array matches the target char array or not. I am doing it using for loop and comparing every element with respective element in target char array.
It is taking much more time than it takes on CPU.
Can anyone help me understand the issue here.
Any suggestion for improvement would be appreciated.

Can you post some code?
Also make sure you understand memory coalescing: