How many thread are executed at the same time ?

Hi,

I accelerated an image processing application but I would like to know how many threads are executed at the same time.

I’m currently developing on the SoC Tegra K1. TK1 has 1 SMX which contain 192 cuda core.
I’m processing Image with 1024x1024 dimension. I decided to create 1024 blocks with 1024 threads in each block to have Number of pixels = Number of threads

→ gridSize(32,32); //1024 blocks
→ blockSize(32,32); //1024 threads

I also know that we can launch only 1 block at the same time because there is only 1 SMX. The GPU won’t execute 1024 threads at the same time because it owns only 192 core. Is the GPU executing 192 threads at the same time (while others are waiting) ?

Thanks!

Tegra K1 has 1 SM with 4 schedulers. Each scheduler can dual-issue each cycle. Theoretical instruction issue rate is 4 schedulers x 2 instructions/scheduler x 32 threads/instruction = 256 threads/cycle.

Due to instruction mix and instruction fetch it is almost impossible to sustain higher than 7 instructions/cycle. 5-6 instructions/cycle is very high.

  • The SM can manage more than 1 block at a time. See occupancy and device limits in the programming guide.
  • The number of CUDA core is the number of fp32 thread instructions that can be processed per cycle not the number of threads that can be issued. The SM also has special purpose floating point units, load store units, fp64 units, branch units, etc.

Each scheduler can dual-issue each cycle. Theoretical instruction issue rate is 4 schedulers x 2 instructions/scheduler x 32 threads/instruction = 256 threads/cycle.

AFAIK, dual-issue ability means two instructions from the same thread on all GeForces starting from CC 2.0, isn’t it??

CC2.0 can issue 2 instructions from different warps.
CC2.1-6.x schedulers can dual issue 2 consecutive instructions from the same warp. For CC3.0-6.x the pairing is defined by the compiler.

yes, so f.e. one SMX can execute 128 command pairs, but commands in each pair should go from the same thread. Overall, in single cycle it executes up to 256 commands from 128 threads

Answering the original question: some CPUs have HyperThreading that allows to execute 2 threads on the same core, sharing its resources between those threads. GPUs also can run multiple threads per core simultaneously. In particular, recent NVidia GPUs can execute up to 16 threads/core, well, if we define “core” in proper way (4 schedulers x 32 cores/scheduler = 128 cores per SMX). So, one SMX can execute up to 2048 threads simultaneously, therefore two 1024-thread blocks can be run concurrently, given sufficient other resources.

Note that our discussion with Greg was about number of threads executing in a SINGLE GPU cycle. It’s limited to 128, equal to the number of cores computed in my way. But each core stores state for 16 threads simultaneously, so it can execute commands from other threads in other GPU cycles, allowing all 1024 threads in a thread block to exchange information, synchronize and so on.

Each SMX has 4 warp schedulers. Each warp scheduler selects a warp and can single-issue or dual-issue for the warp. This would mean 4 warps x 2 instructions x 32 = 256 thread instructions issued per cycle.

Answering the original question: some CPUs have HyperThreading that allows to execute 2 threads on the same core, sharing its resources between those threads. GPUs also can run multiple threads per core simultaneously. In particular, recent NVidia GPUs can execute up to 16 threads/core, well, if we define “core” in proper way (4 schedulers x 32 cores/scheduler = 128 cores per SMX). So, one SMX can execute up to 2048 threads simultaneously, therefore two 1024-thread blocks can be run concurrently, given sufficient other resources.

In my opinion this is not the correct way to compare a CPU to a GPU. My recommendation would be to compare

  • 1 SM = 4 CPU cores
  • 1 warp = 1 CPU SMT thread
  • 32 threads/warp = CPU SIMD unit with 32 lanes
  • 1 GPU CUDA core = 16-32 ALU datapaths (or less if SIMD datapath)

Many CPU cores can issue 2-7 instructions per cycle. For example, high end ARM cores can dual or triple issue. i7 core can issue many more.

The GPU has more SMT threads (warps) in order to handle latency. CPU has lower memory latency and depends on out of order execution to hide latency with less threads.

Hello to all.
I have GeForce 960GTX, Maxwell 2.0, CC 5.2
and i would like to know if my calcs is ok for this card:
1GPU = 2GPC
1GPC = 4SMM
=> 2GPC * 4SMM = 8SMM
1SMM= 4 * (8 * 4) = 128Cores
=> 8SMM * 128Cores = 1024 Total Cores

64Warp/SMM = 64 * 8 = 512 Total Warp
=> 512Total Warp * 32Threads/Warp = 16384 Total Threads
or 2048 Threads/SMM * 8SMM = 16384 Total Threads
32Block/SMM * 8SMM = 256 Total Blocks

From all the above:
16384Total Threads / 1024Cores = Every Core could execute 16 Threads

I suppose the GPU can execute 16384 threads parallel (maybe simultaneous?).
Is that correct?

Yes, a GTX960 has 8 SM’s each of which can have a maximum of 2048 threads resident on it at any particular time. Those threads can all have instructions in flight on various execution units of the SM.

A core is a pipelined execution unit that executes FP32 add, multiply, or multiply-add. It can accept one new instruction every clock. The core may have multiple instructions in flight, up to the number of pipeline stages.

Hi! Sorry to revive an old topic, but it’s similar to mine - I’d like to know how many threads theoretically are executed simultaneously on Geforce GTX 1650 mobile. The spec sheet says it has 1024 cores, however I see that the actual number of simultaneously running threads is derived somehow from the number of SMs and number of warps per SM :/

The GTX 1660 is a TU116 with 22 SMs. Each SM has 4 sub-partitions. Each sub-partition can issue 1 warp instruction per cycle. TU116 has a maximum of 32 warps/SM.

Thread Instruction Issued/cycle = 22 SMs x 4 sub-partitions/SM x 1 warp instruction/sub-partition x 32 threads/warp = 2816

Resident Threads - 22 SMs x 32 Warps/SM x 32 Threads/Warp = 22,528