Hi, I found after each kernel’s execution, gpu driver flushes L1 and L2 caches even I set ncu cache control to none and replay mode to application.
The following is the test I did.
I modify the vectorAdd application in cuda sdk. Simply copy the call of vectorAdd and set the item numbers to 500 which is smaller enough to hold on cache. I set the ncu cache control to none and replay mode to application.
In the ncu report, the L2 hit rate of the second vectorAdd is zero. That means arrays A and B no longer exist in L2.
In PTX manua .ca l, I can understand the default
ca will flush invalidate L1. But I set the compilation flag to
cg, the experiment results are same that L2 hit rate of the second vectorAdd is zero.
For single stream, I think this L2 flush policy doesn’t make sense. Is this the real policy or some special cache control in ncu?