How to optimize code according to profile summary?

There are several kernels in my code, and I profiled it on my GTX285 card.
The profile summary is presented in the following. Double precision is used in the code.

According to the summary, how should I optimize my code?
I tried to optimize it by changing data structure and algorithm so as to get data coalesced,
however, the code runs even slower!

Single kernel call time is also calculated in the profile summary.

profile summary