I understand that misaligned access alone does not have a huge effect on performance for modern GPUs. However, I have come across sort of a case with “semi-misaligned” access, namely a mixture of strided and misaligned access, in the form of
__global__ void kernel(float* data, int* offsets_by_block, int n_contiguous, int stride) {
int offset = offsets_by_block[blockIdx.x];
int group_id = threadIdx.x / n_contiguous;
int id_in_group = threadIdx.x % n_contiguous;
float a = data[offset + (n_contiguous + stride) * group_id + id_in_group];
// do something with a
}
Essentially, the threads are expected to access a chunk of data with an initial offset. The chunk of data is divided into groups with sizes of n_contiguous, and any two consecutive groups have a gap of stride between them. Now, assuming n_contigous >> size_of_half_warp, would such an access pattern have a significant impact on performance? Thanks!