Unexpected Performance Results for Misaligned Memory Access on Pascal

I’ve been trying to measure the memory throughput of misaligned memory accesses on Pascal and Kepler, both with cache enabled, with the following Kernel:

__global__ void array_copy (float *src, float *dest, int offset) {
  const int index = blockDim.x * blockIdx.x + threadIdx.x + offset;
  dest[index] = src[index];
}

which takes two properly aligned arrays src and dest, and copies the values starting with an offset. Both arrays have been initialized previously.

What you would expect is that offsets 0 and 32 (32 being 32 floats, so 128 bytes aka one cache line) show the highest throughput, since with those offsets, all accesses to the global memory are properly aligned with the cache lines. On Kepler, you can see exactly that; With those offsets, the throughput is significantly higher than with other offsets. On Pascal, however, those offsets are actually among the slowest, and the values in between follow no discernible logic.

Here are some graphs to visualize the data below: https://imgur.com/a/QnbNJ3f

Pascal:

offset: 0, throughput: 533.825989.2 GB/s
offset: 1, throughput: 534.513916.2 GB/s
offset: 2, throughput: 534.693115.2 GB/s
offset: 3, throughput: 534.695679.2 GB/s
offset: 4, throughput: 534.590637.2 GB/s
offset: 5, throughput: 534.467834.2 GB/s
offset: 6, throughput: 534.486633.2 GB/s
offset: 7, throughput: 534.593140.2 GB/s
offset: 8, throughput: 537.638916.2 GB/s
offset: 9, throughput: 536.225708.2 GB/s
offset: 10, throughput: 536.297852.2 GB/s
offset: 11, throughput: 536.257996.2 GB/s
offset: 12, throughput: 536.229187.2 GB/s
offset: 13, throughput: 533.259399.2 GB/s
offset: 14, throughput: 533.235229.2 GB/s
offset: 15, throughput: 533.214417.2 GB/s
offset: 16, throughput: 534.478210.2 GB/s
offset: 17, throughput: 535.073853.2 GB/s
offset: 18, throughput: 535.102661.2 GB/s
offset: 19, throughput: 535.060608.2 GB/s
offset: 20, throughput: 535.047180.2 GB/s
offset: 21, throughput: 535.672546.2 GB/s
offset: 22, throughput: 535.736755.2 GB/s
offset: 23, throughput: 535.696655.2 GB/s
offset: 24, throughput: 537.775024.2 GB/s
offset: 25, throughput: 535.394775.2 GB/s
offset: 26, throughput: 535.267944.2 GB/s
offset: 27, throughput: 535.284485.2 GB/s
offset: 28, throughput: 535.247864.2 GB/s
offset: 29, throughput: 533.771790.2 GB/s
offset: 30, throughput: 533.885498.2 GB/s
offset: 31, throughput: 533.861511.2 GB/s
offset: 32, throughput: 533.017395.2 GB/s

Kepler

offset: 0, throughput: 168.254944.2 GB/s
offset: 1, throughput: 161.967239.2 GB/s
offset: 2, throughput: 161.972595.2 GB/s
offset: 3, throughput: 161.960922.2 GB/s
offset: 4, throughput: 161.973801.2 GB/s
offset: 5, throughput: 161.960007.2 GB/s
offset: 6, throughput: 161.945938.2 GB/s
offset: 7, throughput: 161.979050.2 GB/s
offset: 8, throughput: 162.689224.2 GB/s
offset: 9, throughput: 161.394135.2 GB/s
offset: 10, throughput: 161.391571.2 GB/s
offset: 11, throughput: 161.402939.2 GB/s
offset: 12, throughput: 161.408340.2 GB/s
offset: 13, throughput: 161.390182.2 GB/s
offset: 14, throughput: 161.407379.2 GB/s
offset: 15, throughput: 161.380463.2 GB/s
offset: 16, throughput: 162.527832.2 GB/s
offset: 17, throughput: 161.549103.2 GB/s
offset: 18, throughput: 161.536224.2 GB/s
offset: 19, throughput: 161.542389.2 GB/s
offset: 20, throughput: 161.528473.2 GB/s
offset: 21, throughput: 161.556274.2 GB/s
offset: 22, throughput: 161.553864.2 GB/s
offset: 23, throughput: 161.544861.2 GB/s
offset: 24, throughput: 162.908646.2 GB/s
offset: 25, throughput: 161.572281.2 GB/s
offset: 26, throughput: 161.590210.2 GB/s
offset: 27, throughput: 161.559464.2 GB/s
offset: 28, throughput: 161.590729.2 GB/s
offset: 29, throughput: 161.580505.2 GB/s
offset: 30, throughput: 161.565475.2 GB/s
offset: 31, throughput: 161.572266.2 GB/s
offset: 32, throughput: 168.043701.2 GB/s

Does anyone know what’s happening here? Why are some misaligned offsets even faster than accessing the arrays with proper alignment? Is this some quirk of the Pascal architecture?

EDIT 1: Added Kepler numbers, added graphs for visualization

EDIT 2: More precise explanation of the problem

I assume those are the Pascal numbers?

Post the Kepler numbers also, so we can see the differences (if any).

Define “all over the place” - looks pretty consistent to me…

I’ve added the Pascal numbers. Here are some graphs with explanations what exactly the problem is: https://imgur.com/a/QnbNJ3f

Rule of thumb based on 35 years of performance-oriented software engineering: Any performance difference under 2% should be treated as noise. Here you are trying to interpret performance differences < 1%.