[Edit] See third post below
I am trying to force global loads through the L2 cache on the Tegra K1 (Kepler). I have designed a very simple kernel where each thread block loads the same consecutive array of 8-bit unsigned integers using the following PTX instruction:
ld.global.cg.u8 %lda, [%src]
The .cg in the instruction should force the GPU to cache the global memory access in the L2 cache (as per this documentation: http://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld).
To prevent the compiler from optimising away my global load, which I struggled with for a long time, I put this in at the end:
if(!src){
src[index] = xxxx
}
When I run this in nvprof, I get the same global load throughput as L2 read throughput:
gld_throughput Global Load Throughput 2.1534GB/s 2.1534GB/s 2.1534GB/s
l2_read_throughput L2 Throughput (Reads) 2.2487GB/s 2.2487GB/s 2.2487GB/s
…indicating that nothing is indeed put in the L2 cache, even though 1M/32 thread blocks access the same consecutive array of 32 8-bit unsigned integers.
What am I missing here? Why is my data not getting cached? Is there any way to force nvcc to accept my code, even though it thinks it’s “stupid”, without purpose and non-optimised? :P
/tearing my hair out after > 16 hours of PTX assembly coding
Also adding an example kernel, and the nvprof output. This kernel is meant to resemble some very simple operation to add two numbers, where
uint8_t *matrix
should be cached as all thread blocks access this structure!
/* Tests l2 cached memory loads */
__global__ void memcpy_l2_cached_read(uint8_t *buf, uint8_t *matrix){
uint8_t var;
uint16_t var2;
int ind, i;
ind = threadIdx.x;
uint8_t *src, *lsrc;
src = &matrix[ind];
ind = threadIdx.x + blockDim.x * blockIdx.x;
lsrc = &buf[ind];
/* Variable declarations */
asm(
".reg .s32 %src;\n\t"
".reg .s32 %lsrc;\n\t"
".reg .u8 %lda;\n\t"
".reg .u8 %ldb;\n\t"
".reg .u16 %ldc;\n\t"
".reg .u16 %ldd;\n\t"
".reg .u16 %ldr;\n\t"
".reg .u8 %lds;\n\t"
);
/* Variable initialisations */
asm(
"mov.u32 %src, %0;\n\t"
"cvta.to.global.u32 %src, %src;\n\t"
"mov.u32 %lsrc, %1;\n\t"
"cvta.to.global.u32 %lsrc, %lsrc;\n\t"
:: "r"(src), "r"(lsrc)
);
/* Actual loading */
asm(
"ld.global.cg.u8 %lda, [%src];\n\t"
"ld.global.u8 %ldb, [%lsrc];\n\t"
"cvt.u16.u8 %ldc, %lda;\n\t"
"cvt.u16.u8 %ldd, %ldb;\n\t"
"add.u16 %ldr, %ldd, %ldc;\n\t"
"mov.u16 %0, %ldr;\n\t"
: "=h"(var2)
);
var = (uint8_t) var2;
if(!buf){
buf[ind] = var * *src;
}
}
nvprof
==1672== Event result:
Invocations Event Name Min Max Avg
Device "GK20A (0)"
Kernel: memcpy_l2_cached_read(unsigned char*, unsigned char*)
1 l2_subp0_total_read_sector_queries 978969 978969 978969
1 l2_subp0_total_write_sector_queries 312521 312521 312521
1 elapsed_cycles_sm 11829920 11829920 11829920
1 uncached_global_load_transaction 937500 937500 937500
==1672== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "GK20A (0)"
Kernel: memcpy_l2_cached_read(unsigned char*, unsigned char*)
1 l1_cache_global_hit_rate L1 Global Hit Rate 0.00% 0.00% 0.00%
1 l1_cache_local_hit_rate L1 Local Hit Rate 0.00% 0.00% 0.00%
1 ipc Executed IPC 0.368545 0.368545 0.368545
1 gld_requested_throughput Requested Global Load Throughput 2.1534GB/s 2.1534GB/s 2.1534GB/s
1 gst_throughput Global Store Throughput 717.81MB/s 717.81MB/s 717.81MB/s
1 gld_throughput Global Load Throughput 2.1534GB/s 2.1534GB/s 2.1534GB/s
1 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
1 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
1 issued_ipc Issued IPC 0.477173 0.477173 0.477173
1 gld_transactions Global Load Transactions 937500 937500 937500
1 gst_transactions Global Store Transactions 312500 312500 312500
1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 inst_executed Instructions Executed 4375000 4375000 4375000
1 nc_l2_read_throughput L2 Throughput (Non-Coherent Reads) 0.00000B/s 0.00000B/s 0.00000B/s
1 nc_l2_read_transactions L2 Non-Coherent Read Transactions 0 0 0
1 inst_fp_32 FP Instructions(Single) 0 0 0
1 inst_fp_64 FP Instructions(Double) 0 0 0
1 inst_integer Integer Instructions 50000000 50000000 50000000
1 inst_bit_convert Bit-Convert Instructions 10000000 10000000 10000000
1 inst_control Control-Flow Instructions 10000000 10000000 10000000
1 inst_compute_ld_st Load/Store Instructions 40000000 40000000 40000000
1 l2_read_transactions L2 Read Transactions 978969 978969 978969
1 l2_write_transactions L2 Write Transactions 312521 312521 312521
1 l2_read_throughput L2 Throughput (Reads) 2.2487GB/s 2.2487GB/s 2.2487GB/s
1 l2_write_throughput L2 Throughput (Writes) 717.85MB/s 717.85MB/s 717.85MB/s
I found out that the event counters listed (gld_throughput and l2_read_throughput) do not actually represent the DRAM <-> L2 bandwidth. According to this post: http://stackoverflow.com/questions/11926222/how-to-profile-the-number-of-global-memory-transactions-without-using-uncached-g I need the event counter
fb_read_transactions
for this purpose. However, the Tegra K1 does not expose this event counter. Is there any other way to measure DRAM R/W bandwidth (than using event counters)?