Pascal L1 cache

I can confirm with a quick test that your observation that an L2 load miss to 1 sector (in a 4 x 32 byte sector cache line) to data in device memory causes a 2 sector (64 bytes) request to the HBM2 memory controller. A quick test showed the data to be either the lower 2 32 byte sectors or upper 32 byte sectors. I have not run this test on other GPUs where the memory type used may also benefit from 64 byte requests (e.g. GDDR5x on GP10x).

The GV100 L2 maintains the same 128 byte cache line with 4 x 32 byte sectors as previous NVIDIA GPU architectures. My guess is that the change to 64 bytes on L2 miss to device memory is due to the HBM2 interface. I have not had time to test pinned system memory or peer memory accesses via L2 but I believe these would maintain the 32 byte request to the final destination.