Hello ! I’m still learning about CUDA and GPUs. I have been checking the forums but I don’t find it clear why am I obtaining these results in the percentage of cache hits. In the following example code I’m reading the same value from global memory in every thread, and storing the data on shared memory so I don’t modify the original value.
Im using a GTX 980Ti.
#include <stdio.h>
const int BLOCKS = 100;
const int WARPS_BLOCK = 16;
const int THREADS_WARP = 32;
const int THREADS_BLOCK = WARPS_BLOCK * THREADS_WARP;
__global__ void test(int* dummy) {
__shared__ int dummy_shared[THREADS_BLOCK];
dummy_shared[threadIdx.x] = dummy[0];
}
int main() {
int* dummy;
cudaMalloc((void**)&dummy, sizeof(int));
cudaMemset(dummy, 0, sizeof(int));
test<<<BLOCKS,THREADS_BLOCK>>>(dummy);
cudaDeviceSynchronize();
cudaFree(dummy);
return 0;
}
nvprof shows that L2 hit rate for reads is 100% as I was expecting. Also, I’m not using the Local Memory, so I’m expecting Local Hit rate to be 0%.
L2 Hit Rate (Texture Reads) 100.00% 100.00% 100.00%
L2 Hit Rate (Texture Writes) 0.00% 0.00% 0.00%
L2 Cache Utilization Low (1) Low (1) Low (1)
Local Hit Rate 0.00% 0.00% 0.00%
My issue resides in L1/tex cache. I was expecting to have a 100% hit rate, but I’m not able to go pass 75%. At first I was writting and reading to the same global address, but I suspected that different SMs would be writting to that same address and the L1 cache should mark the data as invalid/old, so the L1 cache of other SMs would have to request the data again. But changing the writes to shared memory did not make any improvement.
Using shared memory this way, maybe warps from different SMs could write to the same address causing the performance to slow down, but is a different problem that as I understand should affect the cache hit rate.
Global Hit Rate in unified l1/tex 75.00% 75.00% 75.00%
Unified Cache Throughput 39.662GB/s 39.662GB/s 39.662GB/s
Unified Cache Transactions 6400 6400 6400
Unified Cache Utilization Low (1) Low (1) Low (1)
Unified Cache Hit Rate 75.00% 75.00% 75.00%
I would also like to ask, what is unified cache in this context ?. The concept of unified memory as I understood is when the host and the device are sharing the same address space for memory, but since I’m using cudaMalloc, I’m not using this concept. Is the Unified Cache here referring to that unified memory ? Is it something different ?
I also checked that the dissambled SASS code isn’t making any changes in the kernel behavior:
code for sm_30
Function : _Z8testPi
.headerflags @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
/* 0x2203f28042804307 */
/*0008*/ MOV R1, c[0x0][0x44]; /* 0x2800400110005de4 */
/*0010*/ MOV R2, c[0x0][0x140]; /* 0x2800400500009de4 */
/*0018*/ MOV R3, c[0x0][0x144]; /* 0x280040051000dde4 */
/*0020*/ LD.E R2, [R2]; /* 0x8400000000209c85 */
/*0028*/ S2R R4, SR_TID.X; /* 0x2c00000084011c04 */
/*0030*/ SHL R4, R4, 0x2; /* 0x6000c00008411c03 */
/*0038*/ STS [R4], R2; /* 0xc900000000409c85 */
/* 0x20000000000002f7 */
/*0048*/ EXIT; /* 0x8000000000001de7 */
/*0050*/ BRA 0x50; /* 0x4003ffffe0001de7 */
/*0058*/ NOP; /* 0x4000000000001de4 */
/*0060*/ NOP; /* 0x4000000000001de4 */
/*0068*/ NOP; /* 0x4000000000001de4 */
/*0070*/ NOP; /* 0x4000000000001de4 */
/*0078*/ NOP; /* 0x4000000000001de4 */
........................
Im summary: why is the L1 hit rate not 100% ? How would a very simple example kernel with a 100% hit rate on L1 look like ?
Thank you for your time.