Why using [&] is 10% slower than [=] when wrapping a kernel function call in a lambda expression?

Eric_Song · February 8, 2023, 9:15am

Hi, guys,
I am trying wrapping a kernel function call in a lambda expression, but I found that using [&] is 10% slower than [=] on my computer, as

auto launch_kernel = [=]() {
    traverseArray<<<gridSize, blockSize>>>(d_array, width, height);
};

And the measure codes demo is at Cuda_lambda_kernel.

Robert_Crovella · February 10, 2023, 6:43pm

First, my initial observation is the opposite of yours (you: [&]-capture is slower, me: [=]-capture is slower)

Second, when I apply what I consider to be basic benchmarking practice, I observe no difference between the two cases:

$ cat t2186.cu
#include <cuda_runtime.h>
#include <iostream>

__global__ void traverseArray(int *d_array, int width, int height) {
    auto x = blockIdx.x * blockDim.x + threadIdx.x;
    auto y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x < width && y < height) {
        d_array[y * width + x] *= 2;
    }
}

struct CudaKernelExecutionTime {
    template<typename F, typename... Args>
    float operator()(F &&f, Args &&... args) {
        // Start timing
        cudaEvent_t start, stop;
        float time;
        cudaEventCreate(&start);
        cudaEventCreate(&stop);

        // Insert start event
        cudaEventRecord(start, nullptr);
        // call kernel
        f(std::forward<Args>(args)...);
        // insert stop event
        cudaEventRecord(stop, nullptr);

        cudaEventSynchronize(stop);
        cudaEventElapsedTime(&time, start, stop);

        // Free resources
        cudaEventDestroy(start);
        cudaEventDestroy(stop);

        return time;
    }
};

int main() {
    int width = 10;
    int height = 10;
    int size = width * height;
    int *h_array = new int[size];
    int *d_array;
    cudaMalloc((void **) &d_array, size * sizeof(int));

    // Initialize array on host
    for (int i = 0; i < size; i++) {
        h_array[i] = i;
    }

    // Copy array from host to device
    cudaMemcpy(d_array, h_array, size * sizeof(int), cudaMemcpyHostToDevice);

    // Launch kernel to traverse array
    dim3 blockSize(8, 8);
    dim3 gridSize((width + blockSize.x - 1) / blockSize.x, (height + blockSize.y - 1) / blockSize.y);


    // Run kernel
#ifdef USE_EQ
    auto launch_kernel = [=]() {
#else
    auto launch_kernel = [&]() {
#endif
        traverseArray<<<gridSize, blockSize>>>(d_array, width, height);
    };
    CudaKernelExecutionTime timer;
#ifdef USE_WARMUP
    launch_kernel();
    cudaDeviceSynchronize();
#endif
    auto time = timer(launch_kernel);

    // Print time
    std::cout << "Kernel execution time: " << time << "ms"
              << std::endl;


    // Copy array back from device to host
    cudaMemcpy(h_array, d_array, size * sizeof(int), cudaMemcpyDeviceToHost);

    // Print modified array
    for (int i = 0; i < 2; i++) {
        printf("Array element %d: %d\n", i, h_array[i]);
    }

    // Free device memory
    cudaFree(d_array);
    delete[] h_array;

    return 0;
}
$ nvcc -o t2186 t2186.cu
$ ./t2186
Kernel execution time: 0.034304ms
Array element 0: 0
Array element 1: 2
$ nvcc -o t2186 t2186.cu -DUSE_EQ
$ ./t2186
Kernel execution time: 0.038016ms
Array element 0: 0
Array element 1: 2
$ nvcc -o t2186 t2186.cu -DUSE_WARMUP
$ ./t2186
Kernel execution time: 0.011808ms
Array element 0: 0
Array element 1: 4
$ nvcc -o t2186 t2186.cu -DUSE_WARMUP -DUSE_EQ
$ ./t2186
Kernel execution time: 0.011776ms
Array element 0: 0
Array element 1: 4
$

CUDA 11.4, Tesla V100

I also note that when I run the profiler, the kernel execution time on my system is ~3.6us, regardless of which capture/launch variant is used, for your original test case. You’re timing an incredibly short kernel. Whatever overhead or overhead variance there may be here will likely become insignificant for a longer duration kernel.

Eric_Song · February 16, 2023, 7:34am

Thanks sincerely for your detailed explanation and good codes.
As a CUDA beginner, I want to how to use “the profiler” to test a kernel function speed, as you mentioned “when I run the profiler, the kernel execution time on my system is ~3.6us”.
Since my only knowledge about profiling is using Event APIs, and I want to learn a more professional way.

Robert_Crovella · February 16, 2023, 3:58pm

The profiler you use should be chosen based on the GPU you are running on. Eventually, as GPUs become obsolete, all future profiling is expected to be done using nsight systems, and this blog can help you get started. If you are running on a Pascal or older GPU, you should instead use nvprof - a legacy profiler at this point. The blog gives an example of nvprof usage also.

rs277 · February 16, 2023, 7:43pm

While not offering all the benefits of the latest Nsight Compute, the earlier version 2019.5.0, available here, is the last Pascal 6.X supporting Linux version.

Eric_Song · February 18, 2023, 9:34am

I heard that “Using Event APIs and warmup” is the most accurate way to measure the running time of a kernel function, is that true?

Robert_Crovella · February 18, 2023, 5:47pm

I consider the profiler to be the best tool for measuring the running time of a kernel. And there are known situations where cudaEvent based timing can be tricky to use for the purpose of timing an individual kernel.

You can find many forum postings where people discuss the tradeoffs of various different CUDA timing methodologies. Reasonably good timing can be done if properly applied, using either a host based timing method, cudaEvent based timing method, or the profilers. For most of my own use cases, I am only interested in comparisons, so as long as you properly apply and stick with one methodology, you can usually do good comparison work.

For my own uses, discussions of whether or not a CUDA kernel actually took 10 microseconds as reported by host based timing, or 6 microseconds as reported by cudaEvent based timing, or 5 microseconds as reported by the profiler, don’t lead me to useful insight.

Eric_Song · February 21, 2023, 7:22am

I need to elaborate more on my original question, in which I want to compare whether 2D array indexing would seriously impact the performance. But, I failed to apply a 2D array using cudaMalloc, as CUDA_2d_cudaMalloc.

After reading some forum posts, like

we finally came to the conclusion that it is not supported in CUDA to index a 2D array in CPP-style like “d_array[i][j]”.

I am still really grateful for @Robert_Crovella 's guide, which inspires me a lot.

njuffa · February 21, 2023, 8:08am

That is incorrect. CUDA is based on C++ and therefore makes exactly the same array addressing available that is found in C++11.

If by d_array[i][j] you mean the kind of data structure that is not actually a true 2D array but instead an array of pointers to 1D arrays, you can create that in CUDA as well, but I would not recommend it for performance reasons. Here is a worked example.

#include <cstdio>
#include <cstdlib>
#include <complex>
#include "cuComplex.h"

#define N  (2)
#define M  (3)

typedef std::complex<float> T;

__global__ void print_device_matrix (cuComplex** mat)
{
    printf ("matrix on device:\n");
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < M; j++) {
            printf ("(%f, %f)  ", cuCrealf (mat[i][j]), cuCimagf (mat[i][j]));
        }
        printf ("\n");
    }
}

int main (void)
{
    /* allocate host "matrix" */
    T **mat = (T**)malloc (N * sizeof (mat[0]));
    for (int i = 0; i < N; i++) {
        mat[i] = (T *)malloc (M * sizeof (mat[0][0]));
    }
    
    /* fill in host "matrix" */
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < M; j++) {
            mat[i][j] = T (float(i)+1, float(j)+1);
        }
    }

    /* print host "matrix" */
    printf ("matrix on host:\n");
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < M; j++) {
            printf ("(%f, %f)  ", real(mat[i][j]), imag(mat[i][j]));
        }
        printf ("\n");
    }

    /* allocate device "matrix" */
    T **tmp = (T**)malloc (N * sizeof (tmp[0]));
    for (int i = 0; i < N; i++) {
        cudaMalloc ((void **)&tmp[i], M * sizeof (tmp[0][0]));
    }
    cuComplex **matD = 0;
    cudaMalloc ((void **)&matD, N * sizeof (matD[0]));

    /* copy "matrix" from host to device */
    cudaMemcpy (matD, tmp, N * sizeof (matD[0]), cudaMemcpyHostToDevice);
    for (int i = 0; i < N; i++) {
        cudaMemcpy (tmp[i], mat[i], M * sizeof (matD[0][0]), cudaMemcpyHostToDevice);
    }
    free (tmp);

    /* print device "matrix" */
    print_device_matrix<<<1,1>>> (matD);

    /* free host "matrix" */
    for (int i = 0; i < N; i++) {
        free (mat[i]);
    }
    free (mat);
    
    /* free device "matrix" */
    tmp = (T**)malloc (N * sizeof (tmp[0]));
    cudaMemcpy (tmp, matD, N * sizeof (matD[0]), cudaMemcpyDeviceToHost);
    for (int i = 0; i < N; i++) {
        cudaFree (tmp[i]);
    }
    free (tmp);
    cudaFree (matD);

    return EXIT_SUCCESS;
}

Eric_Song · February 21, 2023, 9:52am

Appreciate the codes. I get it now! 😀

Topic		Replies	Views
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	738	December 18, 2017
slow pointers initiation in kernel CUDA Programming and Performance	21	1773	July 8, 2016
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	6	2761	November 17, 2021
cudaLaunchHostFunc + cudaEventElapsedTime? CUDA Programming and Performance	4	861	August 3, 2022
Tiled 2D convolution algorithm as slow as untiled 2D convolution algorithm CUDA Programming and Performance	7	2416	October 12, 2021
Fast copy (device->kernel) after aync kernel call CUDA Programming and Performance	11	3239	July 6, 2015
Cuda code performance CUDA Programming and Performance	14	3143	December 16, 2014
cuda is really slow - even when doing nothing CUDA Programming and Performance	10	2364	September 3, 2010
First kernel execution takes longer CUDA Programming and Performance	8	2860	December 8, 2014
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2363	January 9, 2017

Why using [&] is 10% slower than [=] when wrapping a kernel function call in a lambda expression?

Related topics