How does "semi-misaligned" (a mixture of strided and misaligned) access to global memory affect performance?

I understand that misaligned access alone does not have a huge effect on performance for modern GPUs. However, I have come across sort of a case with “semi-misaligned” access, namely a mixture of strided and misaligned access, in the form of

__global__ void kernel(float* data, int* offsets_by_block, int n_contiguous, int stride) {
    int offset = offsets_by_block[blockIdx.x];
    int group_id = threadIdx.x / n_contiguous;
    int id_in_group = threadIdx.x % n_contiguous;
    float a = data[offset + (n_contiguous + stride) * group_id + id_in_group];
    // do something with a
}

Essentially, the threads are expected to access a chunk of data with an initial offset. The chunk of data is divided into groups with sizes of n_contiguous, and any two consecutive groups have a gap of stride between them. Now, assuming n_contigous >> size_of_half_warp, would such an access pattern have a significant impact on performance? Thanks!

strided and misaligned access, but otherwise coalesced in the sense that warps are requesting adjacent elements from memory, in the worst case (n_contiguous is small, say effectively 128 bytes or so) could result in a 2x reduction in performance of memory reads/writes, as compared to the perfect case. What I’m describing corresponds to slide 13 here. As the size gets larger than the size corresponding to a warp, then the performance approaches perfect performance asymptotically, corresponding to the series 1/2, 2/3, 3/4, 4/5, …

1 Like

Thanks a lot!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.