Unexpected Data Read Behavior on Tesla V100: Cache Line and Memory Access Patterns

To observe the specific data read situation of the GPU from DRAM, I have written code using CUDA. This code’s purpose is to read a 1x1 matrix, multiply its element by 2, and then rewrite it to the matrix. Theoretically, the amount of data read by L2 from DRAM should be 4B, but in practice, it reads 224B. Similarly, the data read by L1 from L2 should be 4B, but it reads 32B (see PIC-1).

(Note: The device I am using is the Tesla V100.)

I attempted to find the reasons for this behavior:

  1. The Cache Line for L2 reading data from DRAM is 32B, meaning that reading one sector requires 32B. Thus, for a 1x1 matrix, it should read one sector, which is 32B instead of 224B (7 sectors). The data read by L1 from L2, being 32B, corresponds to this sector size.
  2. To further investigate, I tried a 1x2 matrix and obtained the same result as the 1x1 matrix. The data read by L2 from DRAM remained 224B (7 sectors). The data read by L1 from L2 was 32B, which at least confirms the 32B sector restriction.
  3. For the third attempt, I used a 1x10 matrix. Here, L2 still reads 224B (7 sectors) from DRAM, but L1 reads 64B from L2. This can be attributed to the use of two sectors for reading due to the fact that 10x4 > 32 (see PIC-2).
  4. In the fourth attempt with a 1x20 matrix, L2 read 288B (9 sectors) from DRAM, and L1 read 96B from L2. This is because 20x4 > 64, using 3 sectors (see PIC-3).

My questions are:

  1. Why is the data read by L2 from DRAM 224B (7 sectors) for 1x1 and 1x2 matrices, when theoretically it should be 32B (1 sector)? Also, why does it become 2 sectors for a 1x10 matrix?
  2. Based on my observations, I conducted multiple experiments and found that when the matrix is 1x33, L2 reads 11 sectors from DRAM. Similarly, for a 1x49 matrix, L2 reads 13 sectors. Can I conclude that there’s a base overhead of 5 sectors when accessing memory between L2 and DRAM in this code? Additionally, it seems that after this base overhead, L2 reads data in increments of 2 sectors. When the data access exceeds 64B, the sector count increases by 2.

Where does this base overhead of 5 sectors come from in this code?

The memory subsystem of the GPU operates at sector granularity (32B). So you won’t see data moving around at a smaller granularity than that. Also other things like kernel code, kernel parameters, etc… need to come from device memory so you can see additional traffic there. Also if the GPU is used for display, traffic for DRAM reads can be higher. Any of those things could cause more data to come from Device Memory. I don’t think you can make a conclusion about a “5 sector overhead”.