Why does the persistent thread approach reduce kernel performance

Consider the following ways of performing fill on a float array.

  • Method1: Each thread performs a single set operation.
  • Method 2: On top of method 1, replace scalar store with vector store.
  • Method 3: Launch a minimum number of threads that saturate the GPU. Each thread performs as many set operations as necessary.
  • Method 4: On top of method 3, replace scalar store with vector store.
  • Method 5: Use CUDA driver API cuMemsetD32 for 32-bit set.
  • Method 6: Use thrust API.

Here’s the implementation:

// nvcc Memset.cu -lcuda -o Memset

#include <cuda.h>
#include <cuda_runtime.h>
#include <thrust/device_ptr.h>
#include <thrust/fill.h>

__global__ void fill1(float* a, size_t num, float value) {
    int gtid = blockIdx.x * blockDim.x + threadIdx.x;
    if (gtid < num) {
        a[gtid] = value;
    }
}

__global__ void fill2(float* a, size_t num, float value) {
    float4* aAlt = reinterpret_cast<float4*>(a);

    int gtid = blockIdx.x * blockDim.x + threadIdx.x;

    size_t numFloor = num / 4 * 4;
    size_t c = numFloor / 4;

    if (gtid < c) {
        aAlt[gtid] = make_float4(value, value, value, value);
    }

    if (gtid == 0) {
        for (int i = 0; i < num - numFloor; ++i) {
            a[num - 1 - i] = value;
        }
    }
}

__global__ void fill3(float* a, size_t num, float value) {
    int numThreads = blockDim.x * gridDim.x;
    int gtid = blockIdx.x * blockDim.x + threadIdx.x;

    for (int i = gtid; i < num; i += numThreads) {
        a[i] = value;
    }
}

__global__ void fill4(float* a, size_t num, float value) {
    float4* aAlt = reinterpret_cast<float4*>(a);

    int numThreads = blockDim.x * gridDim.x;
    int gtid = blockIdx.x * blockDim.x + threadIdx.x;

    size_t numFloor = num / 4 * 4;
    size_t c = numFloor / 4;

    for (int i = gtid; i < c; i += numThreads) {
        aAlt[i] = make_float4(value, value, value, value);
    }

    if (gtid == 0) {
        for (int i = 0; i < num - numFloor; ++i) {
            a[num - 1 - i] = value;
        }
    }
}

int main() {
    cudaSetDevice(0);

    std::size_t num = 200'000'000ULL;
    float* a_d{};
    cudaMalloc(&a_d, sizeof(float) * num);

    int blockSize{};
    int gridSize{};

    float value = 1.0f;
    int loopCount = 5;

    // Method1: Each thread performs a single set operation.
    for (int i = 0; i < loopCount; ++i) {
        cudaOccupancyMaxPotentialBlockSize(&gridSize, &blockSize, fill1);
        gridSize = (num + blockSize - 1) / blockSize;
        fill1<<<gridSize, blockSize>>>(a_d, num, value);
    }

    // Method 2: On top of method 1, replace scalar store with vector store.
    for (int i = 0; i < loopCount; ++i) {
        cudaOccupancyMaxPotentialBlockSize(&gridSize, &blockSize, fill2);
        size_t numFloor = num / 4 * 4;
        size_t c = numFloor / 4;
        gridSize = (c + blockSize - 1) / blockSize;
        fill2<<<gridSize, blockSize>>>(a_d, num, value);
    }

    // Method 3: A minimum number of threads are launched to saturate the GPU.
    // Each thread performs as many number of set operations as needed.
    for (int i = 0; i < loopCount; ++i) {
        cudaOccupancyMaxPotentialBlockSize(&gridSize, &blockSize, fill3);
        fill3<<<gridSize, blockSize>>>(a_d, num, value);
    }

    // Method 4: On top of method 3, replace scalar store with vector store.
    for (int i = 0; i < loopCount; ++i) {
        cudaOccupancyMaxPotentialBlockSize(&gridSize, &blockSize, fill4);
        fill4<<<gridSize, blockSize>>>(a_d, num, value);
    }

    // Method 5: Use CUDA driver API cuMemsetD32 for 32-bit set.
    for (int i = 0; i < loopCount; ++i) {
        cuMemsetD32(reinterpret_cast<CUdeviceptr>(a_d), *reinterpret_cast<unsigned int*>(&value),
                    num);
    }

    // Method 6: Use thrust API.
    for (int i = 0; i < loopCount; ++i) {
        thrust::device_ptr<float> dev_ptr = thrust::device_pointer_cast(a_d);
        thrust::fill(dev_ptr, dev_ptr + num, value);
    }

    cudaDeviceSynchronize();
    cudaFree(a_d);
}

Below is the performance result on a 3080 Ti (Ampere, CC=8.6). Understandably, cuMemsetD32 and thrust have the top performance. But why does fill3, which is supposed to reduce the block scheduling overhead by launching the minimal amount of threads that saturate the device, have worse performance than the vanilla fill1? In particular, fill1 achieves a memory throughput of 97% and has 25.27 warp cycles per issued instruction. These numbers become worse for fill3, 86% and 73.86.

Total Time   Avg
4.607 ms     921.342 micro-s  fill1
4.677 ms     935.300 micro-s  fill2
5.135 ms     1.027 ms         fill3
4.704 ms     940.843 micro-s  fill4
4.660 ms     932.023 micro-s  [CUDA memset]
4.619 ms     923.857 micro-s  thrust

The second question is why using the vector store in place of scalar store (as in fill2) has no effect on fill1? It looks as if fill1 is bottlenecked by the outstanding memory requests alone, and that completely offsets a reduction in the number of memory instructions.

Here are the nsys profile --stats=true ... results I get from running your code on H100 SXM (DGX-H100):

Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)      GridXYZ          BlockXYZ                                                     Name                   
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------  --------------  ----------------------------------------------------------------------------------------------------
     25.5        1,737,866          5  347,573.2  347,547.0   347,515   347,740         94.6  195313    1    1  1024    1    1  fill1(float *, unsigned long, float)                                   
     19.6        1,335,086          5  267,017.2  267,229.0   265,692   268,444      1,062.1   264    1    1    1024    1    1  fill3(float *, unsigned long, float)                                   
     19.3        1,318,383          5  263,676.6  265,084.0   250,557   274,397     10,324.2   264    1    1    1024    1    1  fill4(float *, unsigned long, float)                                   
     17.9        1,219,985          5  243,997.0  244,349.0   242,973   244,573        686.0  390625    1    1   256    1    1  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
     17.8        1,216,816          5  243,363.2  243,132.0   242,813   244,573        693.6  48829    1    1   1024    1    1  fill2(float *, unsigned long, float)                                   

[7/8] Executing 'gpumemtimesum' stats report

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)    Operation
 --------  ---------------  -----  ---------  ---------  --------  --------  -----------  -------------
    100.0        1,240,625      5  248,125.0  248,733.0   245,053   251,837      2,640.6  [CUDA memset]

(CUDA 12.0)

I’m not able to reproduce that result. My results show that fill3 has a notably shorter kernel duration, on average over the 5 invocations, than fill1.

I don’t really understand the second question. My results suggest that fill2 is faster than fill1, but I’m not sure if that is what you are asking.

Sorry, I don’t have a 3080Ti to work on.

According to my test data, ranking from fastest to slowest based on the reported average duration:

  1. fill2
  2. thrust
  3. CUDA memset
  4. fill4
  5. fill3
  6. fill1

@Robert_Crovella Thanks a lot for sharing the result on H100. That’s interesting. I wonder why fill4 is still slower than fill2 given that they both use vector store but fill4 theoretically should incur less block scheduling overhead and perform better.

PS: I ran more tests on as many GPUs as I can access (data below). The results look similar to what has happened on 3080 Ti: fill1 appears faster than fill3 in all cases, and fill2 is not significantly faster than fill1 (except on 1080 Ti).

Perhaps there are some subtle architectural changes back and forth that lead to what we’ve seen.

Table: Performance results for 5 different GPUs

1080 Ti, CUDA 12.2, driver 535.54.03
Ave time [ms]
2.043              fill2
2.111              fill4
2.318              fill1
2.411              fill3
2080 SUPER, CUDA 12.0, driver 525.78.01
Ave time [ms]
1.715              fill1
1.734              fill2
2.165              fill4
2.879              fill3
3070 Laptop, CUDA 11.8, driver 520.61.05
Ave time [ms]
2.143              fill1
2.176              fill2
2.185              fill4
2.239              fill3
3080 Ti, CUDA 12.2, driver 535.54.03
Ave time [ms]
0.921342           fill1
0.935300           fill2
0.940843           fill4
1.027              fill3
Undisclosed Turing Quadro card, CUDA 11.2, driver 460.73.01
Ave time [ms]
2.9735             fill2
2.98781            fill1
3.54335            fill4
4.104              fill3

Here is my data running on GTX 1660 Super:

Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
     27.2       21,228,907          5  4,245,781.4  4,244,303.0  4,239,087  4,255,087      6,586.4  fill3(float *, unsigned long, float)
     24.1       18,787,746          5  3,757,549.2  3,757,485.0  3,757,134  3,758,157        375.0  fill1(float *, unsigned long, float)
     16.5       12,878,061          5  2,575,612.2  2,562,889.0  2,551,017  2,639,529     36,220.8  fill4(float *, unsigned long, float)
     16.1       12,559,948          5  2,511,989.6  2,513,256.0  2,507,369  2,515,433      3,184.6  fill2(float *, unsigned long, float)
     16.1       12,541,739          5  2,508,347.8  2,507,721.0  2,506,472  2,511,817      2,242.5  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)    Operation
 --------  ---------------  -----  -----------  -----------  ---------  ---------  -----------  -------------
    100.0       12,553,965      5  2,510,793.0  2,509,577.0  2,509,417  2,515,177      2,469.9  [CUDA memset]

Again I see that fill1 and fill3 are slowest. Here is the ranking from fastest to slowest:

  1. thrust
  2. CUDA memset
  3. fill2
  4. fill4
  5. fill1
  6. fill3

Given that the performance of the top 3 is very close to each other (approximately 0.1% variation for GTX 1660 Super, or 2% for H100), I personally see no significant difference between the ranking here and the ranking I observed on H100.

Are you using nsight systems for measurement?

Yes, fill1 and fill3 have changed places at the bottom of the ranked list. At the moment I don’t have an explanation for that, but there does appear to be some GPU arch dependency. I think some experiments to try would be to test whether loop unrolling are factors here, or else study carefully the SASS code and the GPU occupancy in each case.

I think one solid takeaway should be that “block scheduling overhead” may be an over-emphasized concept in some treatments of CUDA. I would never communicate (when teaching CUDA) the idea that “block scheduling overhead” is a critical factor to consider.

1 Like

@Robert_Crovella Thanks again.

Yes. I was using Nsight System for the time measurement except for the last entry (“Undisclosed Turing Quadro card”) where the minimal host system has no profiling tools so I used CUDA events for timing.

I changed the data type from float to unsigned char and now fill2/3/4 kernels have similar performance, all being 2.5x faster than fill1 on 3080 Ti. So indeed whether the overhead to launch blocks constitutes performance bottleneck is case-dependent, and is a non-issue for float. I’ll take a closer look at the SASS in this case.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.