I’ve been trying to measure the memory throughput of misaligned memory accesses on Pascal and Kepler, both with cache enabled, with the following Kernel:
__global__ void array_copy (float *src, float *dest, int offset) {
const int index = blockDim.x * blockIdx.x + threadIdx.x + offset;
dest[index] = src[index];
}
which takes two properly aligned arrays src and dest, and copies the values starting with an offset. Both arrays have been initialized previously.
What you would expect is that offsets 0 and 32 (32 being 32 floats, so 128 bytes aka one cache line) show the highest throughput, since with those offsets, all accesses to the global memory are properly aligned with the cache lines. On Kepler, you can see exactly that; With those offsets, the throughput is significantly higher than with other offsets. On Pascal, however, those offsets are actually among the slowest, and the values in between follow no discernible logic.
Here are some graphs to visualize the data below: https://imgur.com/a/QnbNJ3f
Pascal:
offset: 0, throughput: 533.825989.2 GB/s
offset: 1, throughput: 534.513916.2 GB/s
offset: 2, throughput: 534.693115.2 GB/s
offset: 3, throughput: 534.695679.2 GB/s
offset: 4, throughput: 534.590637.2 GB/s
offset: 5, throughput: 534.467834.2 GB/s
offset: 6, throughput: 534.486633.2 GB/s
offset: 7, throughput: 534.593140.2 GB/s
offset: 8, throughput: 537.638916.2 GB/s
offset: 9, throughput: 536.225708.2 GB/s
offset: 10, throughput: 536.297852.2 GB/s
offset: 11, throughput: 536.257996.2 GB/s
offset: 12, throughput: 536.229187.2 GB/s
offset: 13, throughput: 533.259399.2 GB/s
offset: 14, throughput: 533.235229.2 GB/s
offset: 15, throughput: 533.214417.2 GB/s
offset: 16, throughput: 534.478210.2 GB/s
offset: 17, throughput: 535.073853.2 GB/s
offset: 18, throughput: 535.102661.2 GB/s
offset: 19, throughput: 535.060608.2 GB/s
offset: 20, throughput: 535.047180.2 GB/s
offset: 21, throughput: 535.672546.2 GB/s
offset: 22, throughput: 535.736755.2 GB/s
offset: 23, throughput: 535.696655.2 GB/s
offset: 24, throughput: 537.775024.2 GB/s
offset: 25, throughput: 535.394775.2 GB/s
offset: 26, throughput: 535.267944.2 GB/s
offset: 27, throughput: 535.284485.2 GB/s
offset: 28, throughput: 535.247864.2 GB/s
offset: 29, throughput: 533.771790.2 GB/s
offset: 30, throughput: 533.885498.2 GB/s
offset: 31, throughput: 533.861511.2 GB/s
offset: 32, throughput: 533.017395.2 GB/s
Kepler
offset: 0, throughput: 168.254944.2 GB/s
offset: 1, throughput: 161.967239.2 GB/s
offset: 2, throughput: 161.972595.2 GB/s
offset: 3, throughput: 161.960922.2 GB/s
offset: 4, throughput: 161.973801.2 GB/s
offset: 5, throughput: 161.960007.2 GB/s
offset: 6, throughput: 161.945938.2 GB/s
offset: 7, throughput: 161.979050.2 GB/s
offset: 8, throughput: 162.689224.2 GB/s
offset: 9, throughput: 161.394135.2 GB/s
offset: 10, throughput: 161.391571.2 GB/s
offset: 11, throughput: 161.402939.2 GB/s
offset: 12, throughput: 161.408340.2 GB/s
offset: 13, throughput: 161.390182.2 GB/s
offset: 14, throughput: 161.407379.2 GB/s
offset: 15, throughput: 161.380463.2 GB/s
offset: 16, throughput: 162.527832.2 GB/s
offset: 17, throughput: 161.549103.2 GB/s
offset: 18, throughput: 161.536224.2 GB/s
offset: 19, throughput: 161.542389.2 GB/s
offset: 20, throughput: 161.528473.2 GB/s
offset: 21, throughput: 161.556274.2 GB/s
offset: 22, throughput: 161.553864.2 GB/s
offset: 23, throughput: 161.544861.2 GB/s
offset: 24, throughput: 162.908646.2 GB/s
offset: 25, throughput: 161.572281.2 GB/s
offset: 26, throughput: 161.590210.2 GB/s
offset: 27, throughput: 161.559464.2 GB/s
offset: 28, throughput: 161.590729.2 GB/s
offset: 29, throughput: 161.580505.2 GB/s
offset: 30, throughput: 161.565475.2 GB/s
offset: 31, throughput: 161.572266.2 GB/s
offset: 32, throughput: 168.043701.2 GB/s
Does anyone know what’s happening here? Why are some misaligned offsets even faster than accessing the arrays with proper alignment? Is this some quirk of the Pascal architecture?
EDIT 1: Added Kepler numbers, added graphs for visualization
EDIT 2: More precise explanation of the problem