Is L1 global memory caching on GK110 possible??

In a benchmarking kernel in which I repeatedly fetch the same data using global memory there is some evident that these fetches are being cached in L1 though:

  1. I haven’t set the pointing data as “const”.
  2. The experiment was run on a Tesla K20c which is based on GK110 and according to documentation it absolutely does not support L1 caching of global memory data.

The memory fetching is implemented with ptx instructions like this in order to enforce global memory caching where available:

asm volatile ("ld.ca.u32 %0, [%1];" : "=r"(retval) : "l"(p));

The main loop as compiled in CC 3.5 ISA indeed contains instructions that fetch the same data again and again performing one bitwise XOR operation with a zero value after each fetch (this is done in order to disallow the compiler to perform any code elimination). Part of main loop looks like this (using cuobjdump):

...
        /*00c8*/                   LD.E R4, [R10];                               /* 0xc4800000001c2810 */
        /*00d0*/                   ISETP.LT.AND P0, PT, R6, 0x2000, PT;          /* 0xb3181c10001c181d */
        /*00d8*/                   LOP.XOR R16, R4, R10;                         /* 0xe2002000051c1042 */
        /*00e0*/                   LD.E R4, [R16];                               /* 0xc4800000001c4010 */
        /*00e8*/                   LOP.XOR R10, R4, R16;                         /* 0xe2002000081c102a */
        /*00f0*/                   LD.E R4, [R10];                               /* 0xc4800000001c2810 */
        /*00f8*/                   LOP.XOR R16, R4, R10;                         /* 0xe2002000051c1042 */
                                                                                 /* 0x08dca0dca0dca0dc */
        /*0108*/                   LD.E R4, [R16];                               /* 0xc4800000001c4010 */
        /*0110*/                   LOP.XOR R10, R4, R16;                         /* 0xe2002000081c102a */
        /*0118*/                   LD.E R4, [R10];                               /* 0xc4800000001c2810 */
        /*0120*/                   LOP.XOR R16, R4, R10;                         /* 0xe2002000051c1042 */
        /*0128*/                   LD.E R4, [R16];                               /* 0xc4800000001c4010 */
        /*0130*/                   LOP.XOR R10, R4, R16;                         /* 0xe2002000081c102a */
        /*0138*/                   LD.E R4, [R10];                               /* 0xc4800000001c2810 */
                                                                                 /* 0x08a0dca0dca0dca0 */
        /*0148*/                   LOP.XOR R16, R4, R10;                         /* 0xe2002000051c1042 */
        /*0150*/                   LD.E R4, [R16];                               /* 0xc4800000001c4010 */
        /*0158*/                   LOP.XOR R10, R4, R16;                         /* 0xe2002000081c102a */
        /*0160*/                   LD.E R4, [R10];                               /* 0xc4800000001c2810 */
        /*0168*/                   LOP.XOR R16, R4, R10;                         /* 0xe2002000051c1042 */
        /*0170*/                   LD.E R4, [R16];                               /* 0xc4800000001c4010 */
        /*0178*/                   LOP.XOR R10, R4, R16;                         /* 0xe2002000081c102a */
                                                                                 /* 0x08dca0dca0dca0dc */
        /*0188*/                   LD.E R4, [R10];                               /* 0xc4800000001c2810 */
        /*0190*/                   LOP.XOR R16, R4, R10;                         /* 0xe2002000051c1042 */
        /*0198*/                   LD.E R4, [R16];                               /* 0xc4800000001c4010 */
        /*01a0*/                   LOP.XOR R10, R4, R16;                         /* 0xe2002000081c102a */
        /*01a8*/                   LD.E R4, [R10];                               /* 0xc4800000001c2810 */
        /*01b0*/                   LOP.XOR R16, R4, R10;                         /* 0xe2002000051c1042 */
        /*01b8*/                   LD.E R4, [R16];                               /* 0xc4800000001c4010 */
...

Here I provide some profiling metrics as captured on executing on K20c to support my suspicion about the L1 caching:

l1_cache_global_hit_rate: 99.94% (Shouldn’t this be 0% on GK110???)
tex_cache_hit_rate: 0.00% (So no read only cache is being utilized which maps to texture cache)
l1_shared_utilization: Mid (5) (kernel does not use shared memory so all utilization is due just to L1)
l2_utilization: Low (1) (L2 cache does not seem to contribute to kernel’s performance)
ldst_fu_utilization: Max (10) (fully utilized load/store units…)
tex_utilization: Idle (0) (…but idle texture units)

This happens on a K20c (GK110) which afaik is not based on GK110B and thus does not support L1 global memory caching.

In contrast, when the same experiment was done on a GTX-660 (CC3.0) data fetches proved to be totally uncached in L1 as expected.