Confusion about setting kernel block and grid size for maximum occupancy

Hi everyone,
I have a GeForce RTX 2050 GPU (cc 8.6) and I’m testing the execution time of a kernel on this GPU (System: Windows 11, CUDA 12.4). I know that, in order to get the maximum occupancy, you must consider the amount of shared memory, registers, resident blocks/SM, and resident threads/SM for the target GPU and set a proper grid and block size for it. However, the performance I see differs from what I understand from the theory.

These are the specs of my GPU:
Number of SMs: 16
Max number of resident blocks/SM: 16
Max number of resident threads/SM: 1536
I don’t mention shared mem and registers limitations because my kernel uses no shared mem and not so many registers (I’ll put the code at the end)

I have two questions:

  1. I thought that the best and most optimum launch params for my kernel was <<<32, 768>>> (because it assigns 2 blocks to each SM, uses all SMs, and the maximum number of resident threads). However, when I run my kernel with <<<16, 1024>>>, it takes less time to execute. I can’t explain this behavior. What is the reason?

  2. In order to use the full capacity of all SMs, I can use these sizes: <<<32, 768>>>, <<<64, 384>>>, …, <<<256, 96>>>. What is the difference between these in theory? When I launch my kernel with these parameters, I see performance degradation as the number of threads decreases and the number of blocks is increased. Why is that?

This is the code snippet (I read the filter coefficients and input signal from file; I omitted that part of the code):

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <vector>

#include "cuComplex.h"

__constant__ float d_basebandFilterCoeffs[1024];

__global__ void baseBandFilter(cuComplex* output, cuComplex* input,
    const int filterLen, const int inputLen)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    cuComplex sum;
    while (i < inputLen)
    {
        sum = {};
        for (int j{}; j < filterLen; ++j)
        {
            sum.x += d_basebandFilterCoeffs[j] * input[i - j + filterLen - 1].x;
            sum.y += d_basebandFilterCoeffs[j] * input[i - j + filterLen - 1].y;
        }

        output[i] = sum;
        i += (blockDim.x * gridDim.x);
    }
}

#define gpuErrchk() { gpuAssert(__FILE__, __LINE__); }
inline void gpuAssert(const char* file, int line, bool abort = true)
{
    cudaDeviceSynchronize();
    cudaError_t code = cudaGetLastError();
    if (code != cudaSuccess)
    {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

int main()
{
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        return 1;
    }

    const int inputLen = 2 * 1024 * 1024;
    int filterLen = 513;
    std::vector<float> filterCoeffs(filterLen);

    std::vector<cuComplex> input(inputLen);
    std::vector<cuComplex> output(inputLen);

    cuComplex* d_input;
    cuComplex* d_output;
    cudaMalloc(&d_input, (inputLen + (filterLen - 1)) * sizeof(cuComplex));
    cudaMalloc(&d_output, inputLen * sizeof(cuComplex));
    gpuErrchk();

    cudaMemcpy(d_input + filterLen - 1, input.data(), inputLen * sizeof(cuComplex), cudaMemcpyHostToDevice);
    cudaMemset(d_input, 0.0, (filterLen - 1) * sizeof(cuComplex));
    cudaMemcpyToSymbol(d_basebandFilterCoeffs, filterCoeffs.data(), filterLen * sizeof(float));
    gpuErrchk();

    cudaEvent_t _start, _stop;
    cudaEventCreate(&_start);
    cudaEventCreate(&_stop);
    cudaEventRecord(_start, 0);

    //change block and grid size for different tests
    baseBandFilter << <16, 1024 >> > (d_output, d_input, filterLen, inputLen);

    cudaEventRecord(_stop, 0);
    cudaEventSynchronize(_stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, _start, _stop);
    std::cout << "Elapsed time: " << elapsedTime << std::endl;

    gpuErrchk();

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}

GeForce RTX 2050 is not a cc8.6 GPU. The rest of your question depends strongly on clarity here (what GPU you actually have) so I suggest clarifying that first. Run deviceQuery, and copy the text output to a new message in this thread.

(An RTX 2050 is cc7.5, has 14 SMs, each of which can hold a maximum 1024 threads, and in that setting your observation 1 would be quite plausible. I would expect with the grid stride loop kernel design that you have, that the choice (16,1024), which would consist of 2 waves on 14 SMs, to run faster than (32, 768), which would consist of 3 waves on 14 SMs. (14,1024) might be even faster)

Thank you for your reply. I checked my GPU model and it really is RTX 2050. This is the output of deviceQuery that I ran on my laptop:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 2050"
  CUDA Driver Version / Runtime Version          12.4 / 12.4
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 4096 MBytes (4294443008 bytes)
MapSMtoCores for SM 8.6 is undefined.  Default to use 192 Cores/SM
MapSMtoCores for SM 8.6 is undefined.  Default to use 192 Cores/SM
  (016) Multiprocessors, (192) CUDA Cores/MP:    3072 CUDA Cores
  GPU Max Clock rate:                            1455 MHz (1.46 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.4, CUDA Runtime Version = 12.4, NumDevs = 1
Result = PASS

Interesting. TIL

I guess the RTX 2050 is a laptop-only GPU, and it has an Ampere class device, GA107. So it is cc8.6.

(if you want to get rid of the “is undefined” message, download and compile and run the latest version of that sample code).

And it has 16 SMs. So maximum performance from an occupancy perspective should probably be something like (32, 768) or (48,512)

I don’t have that GPU available to me, the closest I have is a cc8.9 GPU (various important specs for this discussion are similar to cc8.6, e.g. 1536 threads/SM). My cc8.9 GPU (L4) happens to have 58 SMs, and when I run your code, I do observe fastest performance report at either (58*3, 512), or (58*2, 768), both at about 0.9 (ms) reported by the code.

So I am not able to explain your observation.

Are you building a debug project in Visual Studio? If so, switch to building a release project.

I do not have access to that particular GPU, but on the two GPUs I tried, the two configurations <<<32,768>>> and <<<16,1024>>> performed identical (within a general measurement noise level of 2%) on one GPU and 15% faster for the first configuration on the other (different model) GPU. So I am not able to confirm the observation. I am running CUDA 12.3.

Running two or more thread blocks per SM will generally provide performance that is no worse and often better than running with a single thread block of maximum size.

As Robert_Crovella noted, all benchmarking (whether involving CPUs or GPUs) must be based on release builds. Benchmarking of debug builds is meaningless.

Yes, I was surprised at first when I saw that the RTX 2050 had cc8.6, and the link you provided pointed out interesting things.

I am running the project in release mode. I also changed arch and code parameters in VS project settings (sm_86, compute_86) but that didn’t make any difference either. Just one more thing: I found this answer to a question with a similar topic on StackOverflow. In the second part of the answer, it says:

Block are not necessarily distributed evenly across the multiprocessors (SMs). If you schedule exactly 16 blocks, a few of the SMs can get 2 or 3 blocks while a few of them go idle. I don’t know why.

Is this information accurate? i.e. could something like this be happening in my case?

Thank you

Thank you. How about when the number of blocks get many more than threads (and therefore SMs have more blocks but with smaller sizes)? Something like <<<256, 96>>> that I mentioned in the question. Should that have a longer execution time?

Very generally speaking, finer granularity of execution, so smaller thread blocks, is often conducive to higher performance, especially when one considers performance across multiple GPU architectures. That does not mean that using smaller thread blocks automagically improves performance, just that on average this tends to allow best utilization of GPU resources and thus best performance.

Note that thread block size may flow naturally from application specifics, e.g. 256 threads per block to handle 16x16 image tiles that divide an image evenly, and therefore choosing a thread-block size that is ill-fitted to the problem at hand may increase overhead costs, for example, due to the handling of end cases.

The lower useful limit of thread block size is often 64 threads, although I have seen applications where thread blocks of just 32 threads turned out to be optimal. As a starting point for new kernel designs I recommend starting with a thread block size of between 128 and 256 threads that is a multiple of 32, and deviate from that only if there are good reasons to do so.

1 Like

I’m a bit skeptical, but it requires context.

Is it theoretically possible that the block scheduler could put two blocks on one SM while another gets none?

  1. Yes, because the behavior of the block scheduler is not formally specified. It is theoretically possible to schedule all blocks on a single SM, while leaving other SMs idle.

  2. Is any of that sensible? Not really. Why would the GPU designers create a design that is clearly non-optimal (leaving resources idle)? It seems preposterous to me. A much more reasonable supposition is to posit an even distribution, whatever that means, subject to various considerations.

  3. I can imagine a case, where 2 blocks get scheduled on the same SM, while another remains idle. Suppose we have a kernel with very short-running blocks. The block scheduler schedules a block on SM 0, then goes on to schedule on SM 1, then goes on to schedule on SM 2 and so on. Suppose there are 16 SMs. By the time we get to SM15, the block on SM 0 has finished. So the block scheduler with no loss of performance or sensibility, could schedule the 16th block on either SM 15 or SM 0, without making preposterous or foolish decisions. Any sort of telemetry could suggest or report that two blocks got scheduled on SM 0, while none got scheduled on SM 15.

You can find out yourself using a methodology similar to what is depicted here. Obviously I cannot give a definitive answer. It is possible.

In all of this, you’ve not actually stated any of your performance measurements, other than in relative terms. If we are chasing small differences here, the entire discussion may be moot or off-base.

If you’re anxious to discover the cause of differences (hopefully not small ones - those can be quite difficult to pinpoint), the profiler is the best tool I know of. With nsight compute you can literally compare “side by side” the behavior of your code with 2 different kernel invocations, across a broad range of metrics and measurement.

I’ve also run into cases where people have said “I’m building a release project” only to discover they made a mistake, don’t know what that means, or actually had an extra -G specified manually in the command line. If you’re at all uncertain, it might not hurt to inspect the compiler commands issued in the VS console for your build.

WDDM can also occasionally throw a wrench in GPU behavior and measurements, although I don’t have anything specific to suggest. Using a GPU for a display at the same time as doing careful benchmarking is not what I would recommend. And to skip one round of back-and-forth, by definition this indication:

means your GPU is being shared as a display element from the HW and OS perspective, even if you think otherwise.

Right, sorry, I didn’t give any numbers. I tried three grid/block sizes and this is the result:

With <<<16, 1024>>>: 4.53ms (avg of 200 times), 4.6ms (once)
With <<<32, 768>>>: 4.58 (avg of 200 times), 4.7ms (once)
With <<<48, 512>>>: 4.80 (avg of 200 times), 4.9ms (once)

I ran the kernel in a for loop with 200 iterations and got the average execution time to be more accurate. The difference between the 1st and 2nd parameters is small, but I thought the second launch parameters had to be faster in any case (I mean I still didn’t have an explanation for this).

So I’ll try your suggestions (Nsight Compute, the method to see how blocks are assigned to SMs, or it could all be normal because of the effect of WDDM). Thank you for all the useful information.

The difference of the average times here is about 1%. A useful practical rule of thumb in performance optimization contexts is that differences < 2% are “in the noise” and therefore not actionable. Instead of considering the average time (which can be unduly influenced by outliers) considering the minimum or median from among 10 measurements or more is often more meaningful.

Right, thank you so much