optimazing the programming of cuda

My program solve the same problem with different algorithm,the accessing memory is the same among them ,but the compute task between the different algorithm is differemt.
I got the result that the compute task less is slower than the anther from the view of the compute peak.
can somebody tell me how to find the bottleneck.