L2 flush policy across kernels

Hi, I found after each kernel’s execution, gpu driver flushes L1 and L2 caches even I set ncu cache control to none and replay mode to application.

The following is the test I did.

I modify the vectorAdd application in cuda sdk. Simply copy the call of vectorAdd and set the item numbers to 500 which is smaller enough to hold on cache. I set the ncu cache control to none and replay mode to application.

In the ncu report, the L2 hit rate of the second vectorAdd is zero. That means arrays A and B no longer exist in L2.

In PTX manua .ca l, I can understand the default ca will flush invalidate L1. But I set the compilation flag to cg, the experiment results are same that L2 hit rate of the second vectorAdd is zero.

For single stream, I think this L2 flush policy doesn’t make sense. Is this the real policy or some special cache control in ncu?