Performance on Tx1

I have used cudamallocmanaged api to access unified memory. I have written a kernel having single for loop inside it. However I am not getting any gain,instead time is increasing far more.Any suggestions for same.
Thanks in advance.

Are you trying to perform serial computations? I.e. one computation after another (like in a for loop). The GPU is very fast at parallel computations, but not so fast at serial computations.