The simple vector addition kernel I am using for the test.
And using a CPU timer to check how long it takes for CPU and GPU to do execution of the addition part only.
// add vector at host side for result checks
iStart_CPU = cpuSecond();
sumArraysOnHost (h_A, h_B, hostRef, nElem);
iElaps_CPU = cpuSecond() - iStart_CPU;
// add vector at device side for result checks
iStart_GPU = cpuSecond();
sumArraysOnGPU <<<grid, block>>>(d_A, d_B, d_C, nElem);
cudaDeviceSynchronize();
iElaps_GPU = cpuSecond() - iStart_GPU;
OUTPUT
Using Device 0: NVIDIA GeForce RTX 3050
Vector size 16777216
sumArraysOnCPU Time elapsed 0.017814 sec
**sumArraysOnGPU <<<32768,512>>> Time elapsed 0.001128sec**
Arrays match.
**GPU is 15.793067 times faster than CPU**
Using Device 0: NVIDIA GeForce RTX 3050
Vector size 16777216
sumArraysOnCPU Time elapsed 0.017940 sec
**sumArraysOnGPU <<<16384,1024>>> Time elapsed 0.001186sec**
Arrays match.
**GPU is 15.124824 times faster than CPU**
Q1: Why does maximizing the number of threads per block lead to an increase in kernel execution time?
I’m not an expert, but on first pass, 16777216 is evenly divisible by both 512 and 1024, so you won’t expect to see any tail effects.
I think you’re probably just looking at a really small sample size (N = 1). For different runs i bet the different versions drift quite a bit on a single run for each. Try throwing those inside of a loop and timing that to smooth it out. Maybe even do a single run before the loop to clear the warmup period.
The RTX3050 GPU has a maximum thread carrying capacity of 1536 threads per SM. Launching blocks of 512 threads means you have the potential to maximize occupancy. When launching 1024 threads per block, you can only reach a maximum of 2/3 of “full” occupancy.
For people using cc8.6 GPUs (like yours) I generally recommend 512 threads per block (maximum, if that is relevant) for this reason.
Q1: Could you clarify what you mean by “have the potential to maximize occupancy” when using 512 threads per block?
Q2: As I can see RTX3050 can’t exceed 2/3 occupancy with 1024 threads per block, why does using 512 threads result in faster execution times, if it doesn’t achieve full occupancy either?
Q2: Because 3 blocks at 512 threads means you’re using 3*512 = 1536 threads. So every three blocks fills an entire SM’s capacity of 1536 threads. With 1024 threads per block you can only run 1 block per SM and you’re missing out on another 512 threads to fully saturate the SM.
Thank you for the answer.
That means whenever I am using 512 block size, that means I can use 1 complete SM, where I have 3 blocks per SM. Am I correct??
I have another confusion here. I saw if I put the block size as 1536, the kernel does not give the correct answer and the profile can’t able to profile my code. Is it because per block holds 1024 threads, though the concept of the block is not present in the hardware view, It’s present in the logical view only?
Yes, potentially. This assumes there are no other occupancy limiters. Occupancy is a fairly involved topic.
That is illegal in CUDA. I encourage everyone to use proper CUDA error checking. If you do that, you will find that errors are reported. You cannot launch any kernel in CUDA where you are requesting 1536 threads per block. The maximum you can request is 1024, and beyond that you will get an error instead of a valid kernel launch.
You may wish to get an orderly introduction to CUDA, I usually recommend this online training series for that.
The cuda-samples repo (below), has a cuda-samples/Samples/1_Utilities/deviceQuery folder if you clone it. If you compile and run it, it lists some extremely useful information for the local gpus it finds. I use it all of the time. The output below is from me running it on my box. You can see there that the maximum number of threads per block is 1024, where just above it says the maximum number of threads per multiprocessor is 1536. It’s a really good general reference while learning cuda.
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 4090"
CUDA Driver Version / Runtime Version 12.2 / 12.0
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 24215 MBytes (25390809088 bytes)
(128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores
GPU Max Clock rate: 2595 MHz (2.60 GHz)
Memory Clock rate: 10501 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 75497472 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.0, NumDevs = 1
Result = PASS
There is a general correlation between higher occupancy and higher performance. The basic reason for this is that the GPU is a latency-hiding machine, and more resident threads generally speaking gives more opportunity for the machine to hide latency. This is covered in more detail in unit 3 of the online training series I mentioned. That does not mean it holds in every case. There are or may be counter-examples. If there are other possible performance limiters in a code (there usually are) then occupancy is not the only descriptor of achieved performance.
The two most important optimization objectives for any CUDA programmer are to expose enough parallelism and make efficient use of the memory subsystems. Unpacking the first of these begins in unit 3 of that online training series. Unpacking the second of these begins in unit 4 of that online training series.