Performance difference with different data input sizes (prime number or power of two)

breyerml · September 18, 2023, 2:31pm

We are currently implementing simple kernels using different programming languages (CUDA, HIP, OpenCL, SYCL) and want to compare the performance of different optimizations with each other (and across programming languages and hardware).
However, we encountered an interesting performance characteristic across all programming languages (and to some extent also on an AMD GPU). We use two different data set sizes: one where the size of the problem on the GPU is a prime number and one where it is a power of two.
(Note that the following code is our unoptimized baseline implementation.)

Kernel launch code:

// note: num_rows_reduced is either a prime number of a power of two
// -> if the data set is a power of two, num_rows_reduced is perfectly divisible by the block size
const std::size_t max_work_group_size = this->get_max_work_group_size();
const auto max_work_group_size_2D = static_cast<int>(std::sqrt(static_cast<real_type>(max_work_group_size)));
const dim3 block(max_work_group_size_2D, max_work_group_size_2D);
const dim3 grid(static_cast<int>(std::ceil(static_cast<double>(num_rows_reduced) / static_cast<double>(block.x))),
                static_cast<int>(std::ceil(static_cast<double>(num_rows_reduced) / static_cast<double>(block.y))));

Actual compute kernel (most naive implementation we could come up with as a baseline):

// real_type is an alias for double
__global__ void device_kernel_assembly_linear(real_type *ret, const real_type *data_d, const unsigned long long num_rows, const unsigned long long num_features, const real_type *q, const real_type QA_cost, const real_type cost) {
    const unsigned long long i = blockIdx.x * blockDim.x + threadIdx.x;
    const unsigned long long j = blockIdx.y * blockDim.y + threadIdx.y;

    if (i < num_rows && j < num_rows && j >= i) {
        real_type temp{ 0.0 };
        for (unsigned long long dim = 0; dim < num_features; ++dim) {
            temp += data_d[i * num_features + dim] * data_d[j * num_features + dim];
        }
        temp = temp + QA_cost - q[i] - q[j];
        if (i == j) {
            temp += cost;
        }

        ret[i * num_rows + j - i * (i + 1) / 2] = temp;
    }
}

If we now execute the above code on an A100 GPU, we get a runtime of 11.0s for our prime number data set and 30.9s for the power of two data set.
That’s nearly a factor of 3 difference, although the prime number data set has only 1 (!) data point more than the power of two data set.

Next, I looked at the profiling results:
cuda_kernel_matrix_linear_prime.ncu-rep (279.3 KB)
cuda_kernel_matrix_linear_power_of_two.ncu-rep (256.4 KB)
According to ncu, the factor of 3 stems from the L2 Cache ↔ L1/TEX Throughput (162.6 * 1e9 vs. 81.4 *1e9).

What is the reason that the memory throughput is that much worse by simply using a data set with one data point more?

(On a side note: if we employ memory coalescing, these differences completely finish; with this optimization, both data sets are equally fast.)

jmarusarz · September 28, 2023, 8:12pm

Because memory coalescing fixes the issue (and is a big part of getting performance on a GPU) it’s likely that you have uncoalesced memory accesses in the power of 2 case. Without doing a detailed analysis of both the source code and reports, it’s hard to say exactly why, but it isn’t surprising that a small change in data size, layout, or access patterns can have a large impact on performance.

breyerml · September 29, 2023, 6:54am

The thing is that according to the profiling results the prime number data set has 5% uncoalesced access while the power of two data set has 8% (68904026112 vs 103129546752 excessive sectors, which is also only a factor of 1.5).
Is it possible that this 3% increase results in a factor of 3 in the total runtime?

jmarusarz · November 30, 2023, 11:48pm

Thanks for the details. I have filed a bug with our engineering to look into this and I’ll let you know when we have more information.

Topic		Replies	Views
Question about kernel granularity CUDA Programming and Performance	5	1246	March 22, 2017
How does Turing architecture's L1 cache and L2 cache work? CUDA Programming and Performance cuda	8	1964	September 23, 2021
Matrix transpose perfomance profile explanation CUDA Programming and Performance hw , cuda , kernel , ncu	9	134	April 26, 2025
Same kernel and data exhibits different performance CUDA Programming and Performance	3	517	December 3, 2021
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	6043	October 29, 2008
Even more Fermi Fun: Uncoalesced writes CUDA Programming and Performance	8	8906	June 5, 2010
Question: Kernel performance anomalies at certain sizes CUDA Programming and Performance	0	722	December 25, 2010
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11282	May 23, 2010
2x slower kernel if the inner dimension is divsible by 16/32 CUDA Programming and Performance cuda , kernel	13	83	July 26, 2024
Optimizing memory coalescence doesn't make my program faster CUDA Programming and Performance	12	526	August 6, 2021

Performance difference with different data input sizes (prime number or power of two)

Related topics