CUDA L2 cache use

I use clock64 to log how many cycles my kernel cost. I guess each time the program access a global address is not cached. See my program example:
original example,
data = p_global_adress.a;

data = p_global_adress.a;

data = p_global_adress.f;

Then I have changed to this:
temp = p_global_adress.a;
data = temp

data = temp;

data = temp;

I compared the cost of the two version, and found it is obvious that the later version has less cost than the previous.
So I guess in the previous version, each access it throw to the global, not to the cache.

In the PTX ISA document, it is said:
PTX ISA version 2.0 introduced optional cache operators on load and store instructions.
The cache operators require a target architecture of sm_20 or higher. For sm_20 and
higher, the cache operators have the following definitions and behavior.

How can I use the L2 cache to improve the performance?

The L2 cache is always enabled. It cannot be disabled. The reference to the PTX ISA document is referring to L1 caching.

If you believe you have some observation that suggests that L2 caching is not in effect, you are misinterpreting things.