Maximizing the number of threads per block leads to longer kernel execution times

The simple vector addition kernel I am using for the test.
And using a CPU timer to check how long it takes for CPU and GPU to do execution of the addition part only.

CODE SNIPPETS

double cpuSecond() {
    struct timeval tp;
    gettimeofday(&tp,NULL);
    return ((double)tp.tv_sec + (double)tp.tv_usec*1.e-6);
}
// add vector at host side for result checks
iStart_CPU = cpuSecond();
sumArraysOnHost (h_A, h_B, hostRef, nElem);
iElaps_CPU = cpuSecond() - iStart_CPU;
// add vector at device side for result checks
iStart_GPU = cpuSecond();
sumArraysOnGPU <<<grid, block>>>(d_A, d_B, d_C, nElem);
cudaDeviceSynchronize();
iElaps_GPU = cpuSecond() - iStart_GPU;

OUTPUT

Using Device 0: NVIDIA GeForce RTX 3050
Vector size 16777216
sumArraysOnCPU Time elapsed 0.017814 sec
**sumArraysOnGPU <<<32768,512>>> Time elapsed 0.001128sec**
Arrays match.

**GPU is 15.793067 times faster than CPU**

Using Device 0: NVIDIA GeForce RTX 3050
Vector size 16777216
sumArraysOnCPU Time elapsed 0.017940 sec
**sumArraysOnGPU <<<16384,1024>>> Time elapsed 0.001186sec**
Arrays match.

**GPU is 15.124824 times faster than CPU**

Q1: Why does maximizing the number of threads per block lead to an increase in kernel execution time?

I’m not an expert, but on first pass, 16777216 is evenly divisible by both 512 and 1024, so you won’t expect to see any tail effects.

I think you’re probably just looking at a really small sample size (N = 1). For different runs i bet the different versions drift quite a bit on a single run for each. Try throwing those inside of a loop and timing that to smooth it out. Maybe even do a single run before the loop to clear the warmup period.

there isn’t much timing difference here (5%).

The RTX3050 GPU has a maximum thread carrying capacity of 1536 threads per SM. Launching blocks of 512 threads means you have the potential to maximize occupancy. When launching 1024 threads per block, you can only reach a maximum of 2/3 of “full” occupancy.

For people using cc8.6 GPUs (like yours) I generally recommend 512 threads per block (maximum, if that is relevant) for this reason.

1 Like

Thank you for your reply.

Q1: Could you clarify what you mean by “have the potential to maximize occupancy” when using 512 threads per block?

Q2: As I can see RTX3050 can’t exceed 2/3 occupancy with 1024 threads per block, why does using 512 threads result in faster execution times, if it doesn’t achieve full occupancy either?

Hope I’m not imposing to answer:

Q2: Because 3 blocks at 512 threads means you’re using 3*512 = 1536 threads. So every three blocks fills an entire SM’s capacity of 1536 threads. With 1024 threads per block you can only run 1 block per SM and you’re missing out on another 512 threads to fully saturate the SM.

2 Likes

Thank you for the answer.
That means whenever I am using 512 block size, that means I can use 1 complete SM, where I have 3 blocks per SM. Am I correct??

I have another confusion here. I saw if I put the block size as 1536, the kernel does not give the correct answer and the profile can’t able to profile my code. Is it because per block holds 1024 threads, though the concept of the block is not present in the hardware view, It’s present in the logical view only?

Yes, potentially. This assumes there are no other occupancy limiters. Occupancy is a fairly involved topic.

That is illegal in CUDA. I encourage everyone to use proper CUDA error checking. If you do that, you will find that errors are reported. You cannot launch any kernel in CUDA where you are requesting 1536 threads per block. The maximum you can request is 1024, and beyond that you will get an error instead of a valid kernel launch.

You may wish to get an orderly introduction to CUDA, I usually recommend this online training series for that.

The cuda-samples repo (below), has a cuda-samples/Samples/1_Utilities/deviceQuery folder if you clone it. If you compile and run it, it lists some extremely useful information for the local gpus it finds. I use it all of the time. The output below is from me running it on my box. You can see there that the maximum number of threads per block is 1024, where just above it says the maximum number of threads per multiprocessor is 1536. It’s a really good general reference while learning cuda.

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 4090"
  CUDA Driver Version / Runtime Version          12.2 / 12.0
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 24215 MBytes (25390809088 bytes)
  (128) Multiprocessors, (128) CUDA Cores/MP:    16384 CUDA Cores
  GPU Max Clock rate:                            2595 MHz (2.60 GHz)
  Memory Clock rate:                             10501 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 75497472 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.0, NumDevs = 1
Result = PASS

Thank you for the answer and tutorial link. I will go through the topics serially from that link.

My last question is;

I understood what 512 threads doing under the hood and how it covers 1 single SM.

Question: Does the highest occupancy equal to lowest execution time? Or some other factors also there?

I ran this code today and got all the information of my GPU. Now, I am trying to understand the meaning of every spec. Thank you for sharing btw.

1 Like

There is a general correlation between higher occupancy and higher performance. The basic reason for this is that the GPU is a latency-hiding machine, and more resident threads generally speaking gives more opportunity for the machine to hide latency. This is covered in more detail in unit 3 of the online training series I mentioned. That does not mean it holds in every case. There are or may be counter-examples. If there are other possible performance limiters in a code (there usually are) then occupancy is not the only descriptor of achieved performance.

The two most important optimization objectives for any CUDA programmer are to expose enough parallelism and make efficient use of the memory subsystems. Unpacking the first of these begins in unit 3 of that online training series. Unpacking the second of these begins in unit 4 of that online training series.

Thank you very much for the answer you gave. I will go through the chapters. Thank you

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.