How many thread are executed at the same time ?

Lass · August 29, 2017, 2:39pm

Hi,

I accelerated an image processing application but I would like to know how many threads are executed at the same time.

I’m currently developing on the SoC Tegra K1. TK1 has 1 SMX which contain 192 cuda core.
I’m processing Image with 1024x1024 dimension. I decided to create 1024 blocks with 1024 threads in each block to have Number of pixels = Number of threads

→ gridSize(32,32); //1024 blocks
→ blockSize(32,32); //1024 threads

I also know that we can launch only 1 block at the same time because there is only 1 SMX. The GPU won’t execute 1024 threads at the same time because it owns only 192 core. Is the GPU executing 192 threads at the same time (while others are waiting) ?

Thanks!

Greg · August 31, 2017, 3:58am

Tegra K1 has 1 SM with 4 schedulers. Each scheduler can dual-issue each cycle. Theoretical instruction issue rate is 4 schedulers x 2 instructions/scheduler x 32 threads/instruction = 256 threads/cycle.

Due to instruction mix and instruction fetch it is almost impossible to sustain higher than 7 instructions/cycle. 5-6 instructions/cycle is very high.

The SM can manage more than 1 block at a time. See occupancy and device limits in the programming guide.
The number of CUDA core is the number of fp32 thread instructions that can be processed per cycle not the number of threads that can be issued. The SM also has special purpose floating point units, load store units, fp64 units, branch units, etc.

BulatZiganshin · September 10, 2017, 9:38pm

Each scheduler can dual-issue each cycle. Theoretical instruction issue rate is 4 schedulers x 2 instructions/scheduler x 32 threads/instruction = 256 threads/cycle.

AFAIK, dual-issue ability means two instructions from the same thread on all GeForces starting from CC 2.0, isn’t it??

Greg · September 10, 2017, 10:50pm

CC2.0 can issue 2 instructions from different warps.
CC2.1-6.x schedulers can dual issue 2 consecutive instructions from the same warp. For CC3.0-6.x the pairing is defined by the compiler.

BulatZiganshin · September 11, 2017, 9:07am

yes, so f.e. one SMX can execute 128 command pairs, but commands in each pair should go from the same thread. Overall, in single cycle it executes up to 256 commands from 128 threads

Answering the original question: some CPUs have HyperThreading that allows to execute 2 threads on the same core, sharing its resources between those threads. GPUs also can run multiple threads per core simultaneously. In particular, recent NVidia GPUs can execute up to 16 threads/core, well, if we define “core” in proper way (4 schedulers x 32 cores/scheduler = 128 cores per SMX). So, one SMX can execute up to 2048 threads simultaneously, therefore two 1024-thread blocks can be run concurrently, given sufficient other resources.

Note that our discussion with Greg was about number of threads executing in a SINGLE GPU cycle. It’s limited to 128, equal to the number of cores computed in my way. But each core stores state for 16 threads simultaneously, so it can execute commands from other threads in other GPU cycles, allowing all 1024 threads in a thread block to exchange information, synchronize and so on.

Greg · September 11, 2017, 4:16pm

Each SMX has 4 warp schedulers. Each warp scheduler selects a warp and can single-issue or dual-issue for the warp. This would mean 4 warps x 2 instructions x 32 = 256 thread instructions issued per cycle.

Answering the original question: some CPUs have HyperThreading that allows to execute 2 threads on the same core, sharing its resources between those threads. GPUs also can run multiple threads per core simultaneously. In particular, recent NVidia GPUs can execute up to 16 threads/core, well, if we define “core” in proper way (4 schedulers x 32 cores/scheduler = 128 cores per SMX). So, one SMX can execute up to 2048 threads simultaneously, therefore two 1024-thread blocks can be run concurrently, given sufficient other resources.

In my opinion this is not the correct way to compare a CPU to a GPU. My recommendation would be to compare

1 SM = 4 CPU cores
1 warp = 1 CPU SMT thread
32 threads/warp = CPU SIMD unit with 32 lanes
1 GPU CUDA core = 16-32 ALU datapaths (or less if SIMD datapath)

Many CPU cores can issue 2-7 instructions per cycle. For example, high end ARM cores can dual or triple issue. i7 core can issue many more.

The GPU has more SMT threads (warps) in order to handle latency. CPU has lower memory latency and depends on out of order execution to hide latency with less threads.

dionsalonik · May 6, 2021, 2:25pm

Hello to all.
I have GeForce 960GTX, Maxwell 2.0, CC 5.2
and i would like to know if my calcs is ok for this card:
1GPU = 2GPC
1GPC = 4SMM
=> 2GPC * 4SMM = 8SMM
1SMM= 4 * (8 * 4) = 128Cores
=> 8SMM * 128Cores = 1024 Total Cores

64Warp/SMM = 64 * 8 = 512 Total Warp
=> 512Total Warp * 32Threads/Warp = 16384 Total Threads
or 2048 Threads/SMM * 8SMM = 16384 Total Threads
32Block/SMM * 8SMM = 256 Total Blocks

From all the above:
16384Total Threads / 1024Cores = Every Core could execute 16 Threads

I suppose the GPU can execute 16384 threads parallel (maybe simultaneous?).
Is that correct?

Robert_Crovella · May 6, 2021, 2:36pm

Yes, a GTX960 has 8 SM’s each of which can have a maximum of 2048 threads resident on it at any particular time. Those threads can all have instructions in flight on various execution units of the SM.

A core is a pipelined execution unit that executes FP32 add, multiply, or multiply-add. It can accept one new instruction every clock. The core may have multiple instructions in flight, up to the number of pipeline stages.

zahari.s.stoyanov · January 21, 2024, 2:38pm

Hi! Sorry to revive an old topic, but it’s similar to mine - I’d like to know how many threads theoretically are executed simultaneously on Geforce GTX 1650 mobile. The spec sheet says it has 1024 cores, however I see that the actual number of simultaneously running threads is derived somehow from the number of SMs and number of warps per SM :/

Greg · January 21, 2024, 7:24pm

The GTX 1660 is a TU116 with 22 SMs. Each SM has 4 sub-partitions. Each sub-partition can issue 1 warp instruction per cycle. TU116 has a maximum of 32 warps/SM.

Thread Instruction Issued/cycle = 22 SMs x 4 sub-partitions/SM x 1 warp instruction/sub-partition x 32 threads/warp = 2816

Resident Threads - 22 SMs x 32 Warps/SM x 32 Threads/Warp = 22,528

Topic		Replies	Views
the 1024 threads can work concurrently? CUDA Programming and Performance	4	847	July 24, 2017
Scheduling threads as Warps CUDA Programming and Performance	3	872	July 11, 2013
Relationship between Threads and GPU core/units CUDA Programming and Performance	5	6425	November 21, 2015
help me understand cuda CUDA Programming and Performance	4	6879	February 10, 2010
Beginner's question about concurrent warp execution. CUDA Programming and Performance	3	2484	July 4, 2019
Thread Scheduling Concept CUDA Programming and Performance	3	3716	June 21, 2012
thread, warp, block, grid, device CUDA Programming and Performance	3	6458	November 25, 2016
How more exactly a thread is executed on GPU CUDA Programming and Performance	9	2997	March 7, 2017
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19673	July 5, 2011
How many parallel threads? CUDA Programming and Performance	19	9985	October 1, 2021

How many thread are executed at the same time ?

Related topics