Performance difference with different data input sizes (prime number or power of two)

We are currently implementing simple kernels using different programming languages (CUDA, HIP, OpenCL, SYCL) and want to compare the performance of different optimizations with each other (and across programming languages and hardware).
However, we encountered an interesting performance characteristic across all programming languages (and to some extent also on an AMD GPU). We use two different data set sizes: one where the size of the problem on the GPU is a prime number and one where it is a power of two.
(Note that the following code is our unoptimized baseline implementation.)

Kernel launch code:

// note: num_rows_reduced is either a prime number of a power of two
// -> if the data set is a power of two, num_rows_reduced is perfectly divisible by the block size
const std::size_t max_work_group_size = this->get_max_work_group_size();
const auto max_work_group_size_2D = static_cast<int>(std::sqrt(static_cast<real_type>(max_work_group_size)));
const dim3 block(max_work_group_size_2D, max_work_group_size_2D);
const dim3 grid(static_cast<int>(std::ceil(static_cast<double>(num_rows_reduced) / static_cast<double>(block.x))),
                static_cast<int>(std::ceil(static_cast<double>(num_rows_reduced) / static_cast<double>(block.y))));

Actual compute kernel (most naive implementation we could come up with as a baseline):

// real_type is an alias for double
__global__ void device_kernel_assembly_linear(real_type *ret, const real_type *data_d, const unsigned long long num_rows, const unsigned long long num_features, const real_type *q, const real_type QA_cost, const real_type cost) {
    const unsigned long long i = blockIdx.x * blockDim.x + threadIdx.x;
    const unsigned long long j = blockIdx.y * blockDim.y + threadIdx.y;

    if (i < num_rows && j < num_rows && j >= i) {
        real_type temp{ 0.0 };
        for (unsigned long long dim = 0; dim < num_features; ++dim) {
            temp += data_d[i * num_features + dim] * data_d[j * num_features + dim];
        }
        temp = temp + QA_cost - q[i] - q[j];
        if (i == j) {
            temp += cost;
        }

        ret[i * num_rows + j - i * (i + 1) / 2] = temp;
    }
}

If we now execute the above code on an A100 GPU, we get a runtime of 11.0s for our prime number data set and 30.9s for the power of two data set.
That’s nearly a factor of 3 difference, although the prime number data set has only 1 (!) data point more than the power of two data set.

Next, I looked at the profiling results:
cuda_kernel_matrix_linear_prime.ncu-rep (279.3 KB)
cuda_kernel_matrix_linear_power_of_two.ncu-rep (256.4 KB)
According to ncu, the factor of 3 stems from the L2 Cache ↔ L1/TEX Throughput (162.6 * 1e9 vs. 81.4 *1e9).

What is the reason that the memory throughput is that much worse by simply using a data set with one data point more?

(On a side note: if we employ memory coalescing, these differences completely finish; with this optimization, both data sets are equally fast.)

Because memory coalescing fixes the issue (and is a big part of getting performance on a GPU) it’s likely that you have uncoalesced memory accesses in the power of 2 case. Without doing a detailed analysis of both the source code and reports, it’s hard to say exactly why, but it isn’t surprising that a small change in data size, layout, or access patterns can have a large impact on performance.

The thing is that according to the profiling results the prime number data set has 5% uncoalesced access while the power of two data set has 8% (68904026112 vs 103129546752 excessive sectors, which is also only a factor of 1.5).
Is it possible that this 3% increase results in a factor of 3 in the total runtime?