Why does the persistent thread approach reduce kernel performance

King_Crimson · July 27, 2023, 7:01pm

Consider the following ways of performing fill on a float array.

Method1: Each thread performs a single set operation.
Method 2: On top of method 1, replace scalar store with vector store.
Method 3: Launch a minimum number of threads that saturate the GPU. Each thread performs as many set operations as necessary.
Method 4: On top of method 3, replace scalar store with vector store.
Method 5: Use CUDA driver API cuMemsetD32 for 32-bit set.
Method 6: Use thrust API.

Here’s the implementation:

// nvcc Memset.cu -lcuda -o Memset

#include <cuda.h>
#include <cuda_runtime.h>
#include <thrust/device_ptr.h>
#include <thrust/fill.h>

__global__ void fill1(float* a, size_t num, float value) {
    int gtid = blockIdx.x * blockDim.x + threadIdx.x;
    if (gtid < num) {
        a[gtid] = value;
    }
}

__global__ void fill2(float* a, size_t num, float value) {
    float4* aAlt = reinterpret_cast<float4*>(a);

    int gtid = blockIdx.x * blockDim.x + threadIdx.x;

    size_t numFloor = num / 4 * 4;
    size_t c = numFloor / 4;

    if (gtid < c) {
        aAlt[gtid] = make_float4(value, value, value, value);
    }

    if (gtid == 0) {
        for (int i = 0; i < num - numFloor; ++i) {
            a[num - 1 - i] = value;
        }
    }
}

__global__ void fill3(float* a, size_t num, float value) {
    int numThreads = blockDim.x * gridDim.x;
    int gtid = blockIdx.x * blockDim.x + threadIdx.x;

    for (int i = gtid; i < num; i += numThreads) {
        a[i] = value;
    }
}

__global__ void fill4(float* a, size_t num, float value) {
    float4* aAlt = reinterpret_cast<float4*>(a);

    int numThreads = blockDim.x * gridDim.x;
    int gtid = blockIdx.x * blockDim.x + threadIdx.x;

    size_t numFloor = num / 4 * 4;
    size_t c = numFloor / 4;

    for (int i = gtid; i < c; i += numThreads) {
        aAlt[i] = make_float4(value, value, value, value);
    }

    if (gtid == 0) {
        for (int i = 0; i < num - numFloor; ++i) {
            a[num - 1 - i] = value;
        }
    }
}

int main() {
    cudaSetDevice(0);

    std::size_t num = 200'000'000ULL;
    float* a_d{};
    cudaMalloc(&a_d, sizeof(float) * num);

    int blockSize{};
    int gridSize{};

    float value = 1.0f;
    int loopCount = 5;

    // Method1: Each thread performs a single set operation.
    for (int i = 0; i < loopCount; ++i) {
        cudaOccupancyMaxPotentialBlockSize(&gridSize, &blockSize, fill1);
        gridSize = (num + blockSize - 1) / blockSize;
        fill1<<<gridSize, blockSize>>>(a_d, num, value);
    }

    // Method 2: On top of method 1, replace scalar store with vector store.
    for (int i = 0; i < loopCount; ++i) {
        cudaOccupancyMaxPotentialBlockSize(&gridSize, &blockSize, fill2);
        size_t numFloor = num / 4 * 4;
        size_t c = numFloor / 4;
        gridSize = (c + blockSize - 1) / blockSize;
        fill2<<<gridSize, blockSize>>>(a_d, num, value);
    }

    // Method 3: A minimum number of threads are launched to saturate the GPU.
    // Each thread performs as many number of set operations as needed.
    for (int i = 0; i < loopCount; ++i) {
        cudaOccupancyMaxPotentialBlockSize(&gridSize, &blockSize, fill3);
        fill3<<<gridSize, blockSize>>>(a_d, num, value);
    }

    // Method 4: On top of method 3, replace scalar store with vector store.
    for (int i = 0; i < loopCount; ++i) {
        cudaOccupancyMaxPotentialBlockSize(&gridSize, &blockSize, fill4);
        fill4<<<gridSize, blockSize>>>(a_d, num, value);
    }

    // Method 5: Use CUDA driver API cuMemsetD32 for 32-bit set.
    for (int i = 0; i < loopCount; ++i) {
        cuMemsetD32(reinterpret_cast<CUdeviceptr>(a_d), *reinterpret_cast<unsigned int*>(&value),
                    num);
    }

    // Method 6: Use thrust API.
    for (int i = 0; i < loopCount; ++i) {
        thrust::device_ptr<float> dev_ptr = thrust::device_pointer_cast(a_d);
        thrust::fill(dev_ptr, dev_ptr + num, value);
    }

    cudaDeviceSynchronize();
    cudaFree(a_d);
}

Below is the performance result on a 3080 Ti (Ampere, CC=8.6). Understandably, cuMemsetD32 and thrust have the top performance. But why does fill3, which is supposed to reduce the block scheduling overhead by launching the minimal amount of threads that saturate the device, have worse performance than the vanilla fill1? In particular, fill1 achieves a memory throughput of 97% and has 25.27 warp cycles per issued instruction. These numbers become worse for fill3, 86% and 73.86.

Total Time   Avg
4.607 ms     921.342 micro-s  fill1
4.677 ms     935.300 micro-s  fill2
5.135 ms     1.027 ms         fill3
4.704 ms     940.843 micro-s  fill4
4.660 ms     932.023 micro-s  [CUDA memset]
4.619 ms     923.857 micro-s  thrust

The second question is why using the vector store in place of scalar store (as in fill2) has no effect on fill1? It looks as if fill1 is bottlenecked by the outstanding memory requests alone, and that completely offsets a reduction in the number of memory instructions.

Robert_Crovella · August 3, 2023, 4:38pm

Here are the nsys profile --stats=true ... results I get from running your code on H100 SXM (DGX-H100):

Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)      GridXYZ          BlockXYZ                                                     Name                   
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------  --------------  ----------------------------------------------------------------------------------------------------
     25.5        1,737,866          5  347,573.2  347,547.0   347,515   347,740         94.6  195313    1    1  1024    1    1  fill1(float *, unsigned long, float)                                   
     19.6        1,335,086          5  267,017.2  267,229.0   265,692   268,444      1,062.1   264    1    1    1024    1    1  fill3(float *, unsigned long, float)                                   
     19.3        1,318,383          5  263,676.6  265,084.0   250,557   274,397     10,324.2   264    1    1    1024    1    1  fill4(float *, unsigned long, float)                                   
     17.9        1,219,985          5  243,997.0  244,349.0   242,973   244,573        686.0  390625    1    1   256    1    1  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
     17.8        1,216,816          5  243,363.2  243,132.0   242,813   244,573        693.6  48829    1    1   1024    1    1  fill2(float *, unsigned long, float)                                   

[7/8] Executing 'gpumemtimesum' stats report

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)    Operation
 --------  ---------------  -----  ---------  ---------  --------  --------  -----------  -------------
    100.0        1,240,625      5  248,125.0  248,733.0   245,053   251,837      2,640.6  [CUDA memset]

(CUDA 12.0)

I’m not able to reproduce that result. My results show that fill3 has a notably shorter kernel duration, on average over the 5 invocations, than fill1.

I don’t really understand the second question. My results suggest that fill2 is faster than fill1, but I’m not sure if that is what you are asking.

Sorry, I don’t have a 3080Ti to work on.

According to my test data, ranking from fastest to slowest based on the reported average duration:

fill2
thrust
CUDA memset
fill4
fill3
fill1

King_Crimson · August 4, 2023, 3:00am

@Robert_Crovella Thanks a lot for sharing the result on H100. That’s interesting. I wonder why fill4 is still slower than fill2 given that they both use vector store but fill4 theoretically should incur less block scheduling overhead and perform better.

PS: I ran more tests on as many GPUs as I can access (data below). The results look similar to what has happened on 3080 Ti: fill1 appears faster than fill3 in all cases, and fill2 is not significantly faster than fill1 (except on 1080 Ti).

Perhaps there are some subtle architectural changes back and forth that lead to what we’ve seen.

Table: Performance results for 5 different GPUs

1080 Ti, CUDA 12.2, driver 535.54.03
Ave time [ms]
2.043              fill2
2.111              fill4
2.318              fill1
2.411              fill3

2080 SUPER, CUDA 12.0, driver 525.78.01
Ave time [ms]
1.715              fill1
1.734              fill2
2.165              fill4
2.879              fill3

3070 Laptop, CUDA 11.8, driver 520.61.05
Ave time [ms]
2.143              fill1
2.176              fill2
2.185              fill4
2.239              fill3

3080 Ti, CUDA 12.2, driver 535.54.03
Ave time [ms]
0.921342           fill1
0.935300           fill2
0.940843           fill4
1.027              fill3

Undisclosed Turing Quadro card, CUDA 11.2, driver 460.73.01
Ave time [ms]
2.9735             fill2
2.98781            fill1
3.54335            fill4
4.104              fill3

Robert_Crovella · August 9, 2023, 3:31pm

Here is my data running on GTX 1660 Super:

Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
     27.2       21,228,907          5  4,245,781.4  4,244,303.0  4,239,087  4,255,087      6,586.4  fill3(float *, unsigned long, float)
     24.1       18,787,746          5  3,757,549.2  3,757,485.0  3,757,134  3,758,157        375.0  fill1(float *, unsigned long, float)
     16.5       12,878,061          5  2,575,612.2  2,562,889.0  2,551,017  2,639,529     36,220.8  fill4(float *, unsigned long, float)
     16.1       12,559,948          5  2,511,989.6  2,513,256.0  2,507,369  2,515,433      3,184.6  fill2(float *, unsigned long, float)
     16.1       12,541,739          5  2,508,347.8  2,507,721.0  2,506,472  2,511,817      2,242.5  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)    Operation
 --------  ---------------  -----  -----------  -----------  ---------  ---------  -----------  -------------
    100.0       12,553,965      5  2,510,793.0  2,509,577.0  2,509,417  2,515,177      2,469.9  [CUDA memset]

Again I see that fill1 and fill3 are slowest. Here is the ranking from fastest to slowest:

thrust
CUDA memset
fill2
fill4
fill1
fill3

Given that the performance of the top 3 is very close to each other (approximately 0.1% variation for GTX 1660 Super, or 2% for H100), I personally see no significant difference between the ranking here and the ranking I observed on H100.

Are you using nsight systems for measurement?

Yes, fill1 and fill3 have changed places at the bottom of the ranked list. At the moment I don’t have an explanation for that, but there does appear to be some GPU arch dependency. I think some experiments to try would be to test whether loop unrolling are factors here, or else study carefully the SASS code and the GPU occupancy in each case.

I think one solid takeaway should be that “block scheduling overhead” may be an over-emphasized concept in some treatments of CUDA. I would never communicate (when teaching CUDA) the idea that “block scheduling overhead” is a critical factor to consider.

King_Crimson · August 9, 2023, 4:14pm

@Robert_Crovella Thanks again.

Yes. I was using Nsight System for the time measurement except for the last entry (“Undisclosed Turing Quadro card”) where the minimal host system has no profiling tools so I used CUDA events for timing.

I changed the data type from float to unsigned char and now fill2/3/4 kernels have similar performance, all being 2.5x faster than fill1 on 3080 Ti. So indeed whether the overhead to launch blocks constitutes performance bottleneck is case-dependent, and is a non-issue for float. I’ll take a closer look at the SASS in this case.

system · August 23, 2023, 4:15pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	127	December 31, 2024
Cuda code performance CUDA Programming and Performance	14	3154	December 16, 2014
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	680	April 4, 2017
Efficient use of shared memory CUDA Programming and Performance	29	4502	December 2, 2019
Confusion about setting kernel block and grid size for maximum occupancy CUDA Programming and Performance cuda	11	817	March 30, 2024
Question about CUDA kernels parallel execution CUDA Programming and Performance cuda , parallel-computing	7	2652	April 27, 2024
Dynamic Block Scheduling on hardware latencies CUDA Programming and Performance	28	30076	July 19, 2008
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16302	January 30, 2011
Outrageous speedups over serial code: advice needed CUDA Programming and Performance	17	2806	November 18, 2012
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11164	May 23, 2010

Why does the persistent thread approach reduce kernel performance

Related topics