Hello,
in an attempt to understand the behaviour of L1/L2 cache in jetson nano (Maxwell Architecture).
So according to the documentation of Maxwell :
Kernel memory requests are typically served between the device DRAM and SM on-chip memory using either 128-byte or 32-byte memory transactions.
If both L1 and L2 caches are used, a memory access is serviced by a 128-byte memory transaction. If only the L2 cache is used, a memory access is serviced by a 32-byte memory transaction.
To check this : I compiled my program using (-Xptxas -dlcm=cg) and (-Xptxas -dlcm=ca)
__global__ void readOffset(float* A, const int n,
int offset)
{
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int k = i + offset;
__shared__ float saveVal[512];
//if(k < n)
saveVal[threadIdx.x] = A[k]; // +B[k];
}
void L1_L2Cache(int nOffset)
{
int N = 1024*32;
float *A;
cudaMallocManaged((void**)&A, N * sizeof(float));
dim3 block(512);
dim3 grid((N + block.x - 1) / block.x);
float time;
cudaEvent_t tstart, tstop;
(cudaEventCreate(&tstart));
(cudaEventCreate(&tstop));
(cudaEventRecord(tstart, 0));
readOffset << <grid,block >> > (A,N,nOffset);
cudaDeviceSynchronize();
(cudaEventRecord(tstop, 0));
(cudaEventSynchronize(tstop));
(cudaEventElapsedTime(&time, tstart, tstop));
printf("L1/L2 cache Time : %3.5f ms \n", time);
cudaFree(A);
}
int main()
{
L1_L2Cache(0);
return 0;
}
Using Nvprof, it shows the following metrics for both compilation options :
Device "NVIDIA Tegra X1 (0)"
Kernel: readOffset(float*, int, int)
1 l2_global_load_bytes Bytes read from L2 for misses in Unified Cache for global loads 131072 131072 131072
1 tex_cache_transactions Unified Cache Transactions 4096 4096 4096
1 global_load_requests Total number of global load requests from Multiprocessor 4096 4096 4096
1 gld_transactions_per_request Global Load Transactions Per Request 16.001953 16.001953 16.001953
1 gld_transactions Global Load Transactions 16386 16386 16386
1 ldst_executed Executed Load/Store Instructions 4096 4096 4096
1 l2_read_transactions L2 Read Transactions 4196 4196 4196
To my understanding :
Accessing 1024x32 array items requires 1024 warp-level instructions (=requests) with 32 threads each. each thread reads 4 bytes meaning each warp will read 32x4 = 128 bytes. if L1 is enabled <=> 1 warp memory transaction, if L1 is disabled <=> 4 warp memory transactions which is what I see 1024x4 (Executed Load/Store Instructions).
Can someone please explain these numbers. Thank you