hi, I’m just learning the cuda. I have create a test case for testing L1 and L2 cache behavior and encounter some issue.
I try to read 64 bytes from global memory. Firstly it go to L1 cache and miss all two sectors, then it go to L2 cache for these 2 sectors, again it got cache miss, so it finally go to the dram. And then I found that it get 4 sectors from dram, which I assume it should be 2 sector as the granularity of L2 cache is 32-byte.
I use the RTX4000 to do the test.
the kernel is very simple
static __global__ void sumArraysGPU(short* a, short* b, short* res, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (k < n) {
res[i] = a[i];
}
}
I would not test those numbers with such low occupancy: There could be memory accesses like loading parameters or constants, which are just a few bytes overall, but distort your calculation. I am not sure, if this is an issue here, but for example the shared memory bank conflicts are known to have a slight variability for small numbers.
Use a loop and load from such memory addresses, that the data of each iteration is apart (not consecutive) in memory so that possible prefetching can still be seen as effect.
The tools and GPU HWPM system are not designed to be as accurate as you are requesting. Reasons you can run into issues.
A metric may be collected over multiple passes. Given no method for 100% deterministic replay this can result in variance/error.
GPU has many independent simultaneous engines that may increment a PM.
On more recent GPUs tools have moved to _realtime metrics. These are not 100% accurate. The error is small for typical sample periods (e.g. ±32 for 10000 cycles) but on a small sample you may not see an increment.
There are hardware features on hardware features. On 100 class HBM GPUs and Ada GPUs L2 has 64B promotion enabled by default. 100 class may also have ECC turned on which can increase traffic.
When testing PMs I generally launch equal warps per SM sub-partition (optimal launch) and produce sufficient work that any other increment reason will result in small variance that will be noise.