cannot disable L1 on Fermi

Hello all,

I'm a newbie to CUDA programming and currently I'm stuck on something weird to me. I write an application consisting of two cuda file (.cu) and a cpp file (.cpp) and wanna compare the performance when the L1 cache is disabled/enabled. The makefile structure is similar to that from any SDK benchmark. To disable the L1 cache, I set the following flag in the makefile (the makefile also use common/ from the SDK):

CUDACCFLAGS     := -Xptxas -dlcm=cg

However, even if this flag is set, the global memory accesses still go through the L1 (the number of l1_gld_hit and l1_gld_miss counter in the profiler does not change at all). I really have no idea why the flag doesn't work.
Actually I tried some benchmarks from the CUDA SDK. For these applications, the L1 cache can be disabled or enabled as I expect. 

The GPU I'm using is a GTX 580 and the CUDA toolkit version is 3.2.

Does anybody have a clue on this? Any clarification is appreciated.