Profiling performance and energy consumption of the basic operations

Hi,
I am familiar with the GPU architecture. However, I have barely programmed them using CUDA.
I would like to profile the energy consumption of single operations like addition, multiplication, sin (arithmetic and intrinsic), and also their latency.
For that I have kernels like:

__global__ void sin_intrinsic32_kernel(float* a, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = __sinf(a[idx]);  // CUDA's intrinsic sine function
    }
}

Such kernels are run from a helper function like this:

// Helper to measure kernel execution time and energy consumption for 2-argument kernels
template <typename T>
void run_and_measure(void (*kernel)(T*, T*, int), T* a, T* c, int N, int num_blocks, nvmlDevice_t device,
                     float& avg_time_per_block_ps, float& avg_energy_per_kernel_joules) {
    const int threadsPerBlock = 256;
    const int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

    // Warm-up phase
    kernel<<<blocksPerGrid, threadsPerBlock>>>(a, c, N);
    cudaDeviceSynchronize();

    // Create CUDA events to measure time
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Total time accumulator
    float total_time_ms = 0;
    unsigned long long total_energy = 0;

    // Run the kernel 1K times and average
    const int runs = 1000;
    for (int i = 0; i < runs; ++i) {
        // Measure energy before the kernel
        unsigned long long energy_before = 0;
        nvmlDeviceGetTotalEnergyConsumption(device, &energy_before);

        // Launch the kernel
        cudaEventRecord(start);
        kernel<<<blocksPerGrid, threadsPerBlock>>>(a, c, N);
        cudaEventRecord(stop);
        cudaEventSynchronize(stop);

        // Measure energy after the kernel
        unsigned long long energy_after = 0;
        nvmlDeviceGetTotalEnergyConsumption(device, &energy_after);
        total_energy += (energy_after - energy_before);

        float milliseconds = 0;
        cudaEventElapsedTime(&milliseconds, start, stop);
        total_time_ms += milliseconds;

        cudaDeviceSynchronize();  // Ensure the kernel is done
    }

    // Average time per operation per block
    avg_time_per_block_ps = (total_time_ms / runs) * 1e6 / num_blocks;

    // Average energy per operation
    avg_energy_per_kernel_joules = static_cast<float>(total_energy) / (runs);

    // Destroy CUDA events
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
}

My main looks like this:

int main() {
    const int N = 1 << 20;  // 1M elements
    const int threadsPerBlock = 256;
    const int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

    // Allocate memory for different types
    int *a32, *b32, *c32;
    short *a16, *b16, *c16;
    char *a8, *b8, *c8;
    float *af32, *bf32, *cf32;

    // Allocate unified memory (accessible by both CPU and GPU)
    cudaMallocManaged(&a32, N * sizeof(int));
    cudaMallocManaged(&b32, N * sizeof(int));
    cudaMallocManaged(&c32, N * sizeof(int));
    cudaMallocManaged(&a16, N * sizeof(short));
    cudaMallocManaged(&b16, N * sizeof(short));
    cudaMallocManaged(&c16, N * sizeof(short));
    cudaMallocManaged(&a8, N * sizeof(char));
    cudaMallocManaged(&b8, N * sizeof(char));
    cudaMallocManaged(&c8, N * sizeof(char));
    cudaMallocManaged(&af32, N * sizeof(float));
    cudaMallocManaged(&bf32, N * sizeof(float));
    cudaMallocManaged(&cf32, N * sizeof(float));

    // Initialize input arrays
    for (int i = 0; i < N; ++i) {
        a32[i] = i;
        b32[i] = i + 1;
        a16[i] = static_cast<short>(i);
        b16[i] = static_cast<short>(i + 1);
        a8[i] = static_cast<char>(i);
        b8[i] = static_cast<char>(i + 1);
        af32[i] = static_cast<float>(i);
        bf32[i] = static_cast<float>(i + 1);
    }

    // Energy measurement setup
    nvmlDevice_t device;
    start_energy_measurement(device);

    // Arrays to store time and energy results
    float time_ps[12], energy_joules[12]; // We have 12 operations to measure (3 add, 3 mul, 3 div, 3 sin)

    // ------------- Measure and store results -------------
    // ADDITION
    run_and_measure(add32_kernel, a32, b32, c32, N, blocksPerGrid, device, time_ps[0], energy_joules[0]);
.....
return 0;
}

However, I do not trust the results that I get:

Execution Time (ns):

            Operation         32-bit         16-bit          8-bit          Float
             Addition        9.16976         5.5709        3.89919        9.17013
       Multiplication        9.20029        5.58998        3.89656        9.16062
             Division        9.13221        6.33858        4.75525        9.07272
      Sine (standard)              -              -              -            9.9
     Sine (intrinsic)              -              -              -        6.51939

Energy Consumption (mJ):

            Operation         32-bit         16-bit          8-bit          Float
             Addition            9.9        6.51939          3.686          2.961
       Multiplication           3.01          3.558           5.59          3.499
             Division          3.766          3.949          2.613          2.888
      Sine (standard)              -              -              -          4.154
     Sine (intrinsic)              -              -              -          3.112

Can you please comment on my approach? Is it even necessary to do it like this to get the energy and latency numbers?
If yes, what am I doing incorrectly that I get larger figures for Addition compared to Sin in the latency table.
Also, the energy figures for “Addition32/16” seem incorrect to me! they are 2-3x larger than the rest!

Thank you for your comments in advance.

These energy measurements will be totally dominated by the energy required to access memory, with any computational activity contributing at or below measurement noise level. I doubt that any power / energy measurements from NVML are accurate to better than 5%, if that. NVML is based on simple mass-produced sensors affected by manufacturing variations.

Single-precision division requires multiple FMA instructions and a MUFU instruction (reciprocal), so we would expected energy requirements at one decimal order of magnitude higher than a single-precision multiply. But that is not what we are seeing in the data shown.

Thank you for the quick response.
I guessed that. However, using “Nsight” I could not figure out how to distinguish the latency for the memory access and the computation time for instance. Any hints on that?

IMHO, Nsight is not intended for such investigations and probably unsuitable. For latency investigations, people typically use cleverly constructed microbenchmarks, and for best results often control execution tightly by writing the core of that code directly in machine language, something that is challenging to do for GPUs as NVIDIA doesn’t provide tools to the public to code at SASS level.

Isolating components of power draw / energy consumption is even more challenging. I do not have any experience with that but would assume that it involves calibrated precision measurement devices instead of internal sensors, and/or simulations based on physical models if one has access to the internal design details of components such as adders, multipliers, etc.

A detailed search of the literature may be in order. What is the ultimate goal of this investigation?

Thanks for the explanation.
I will look up the literature.
The idea is to research how the execution of MLP on GPU compares with the execution of KAN on GPU as well as alternative architectures/technologies, which are not CMOS-based. So, it is a feasibility/practicality study.

Regarding performance of single instructions you probably will fare better by using the theoretical numbers for the speed of arithmetic instructions, e.g. as shown in the CUDA C++ Programming Guide (link to chapter 5.4.1)

Example: If you use a RTX 3060 (Ampere generation) with 1320 MHz, you can do 128 floating point additions per cycle per multiprocessor. If your SMs run 6 blocks each, than they need on average 35.5ps (picseconds) per addition per block (the 28 SMs run 168 blocks with those throughput numbers in parallel). That is around 258 times faster than the numbers you measured.

Do not forget the Tensor Cores (if they can speed up KANs by using small matrix-matrix multiplications). They exceed those numbers in the table and can run in parallel to the ‘normal’ arithmetic computations. With the correct coefficients, nearly all functions can be mapped to additions and multiplications. E.g. you can create matrices of vectors of (1 x x² x³ x^4 ...) and multiply with matrices of vectors of coefficients to calculate Talyor approximations of different functions at the same time for a multiple of input data with the Tensor Cores.

Otherwise look up open source code for GPU microbenchmarks like GitHub - sjfeng1999/gpu-arch-microbenchmark: Dissecting NVIDIA GPU Architecture or GitHub - passlab/CUDAMicroBench

But be aware that it is not easy to isolate the running time of single instructions without effects, which speed up (e.g. if you do not use the results, the assembler will remove your instructions) or slow down (no full occupation, latency issues, memory accesses) or make the measurement inaccurate (which timer to use). Just for your feasibility study the maximal theoretical numbers -20% may be the better choice.

Instructions throughput is only half the rent, you have to think about the memory hierarchy, too, which often is the limiting factor for GPU, CPU or FPGA/ASIC code.

If you have more complete implementations (instead of single instructions), it will make more sense to measure performance and power.

@njuffa has much experience, implementing fast approximations to functions and shared some of those, including trigonometric functions like sin. Search through the forum. Depending on the domain (parameter interval) and the required accuracy, you can speed those up compared to the built-in ones.