How many thread are executed at the same time ?

Hi,

I accelerated an image processing application but I would like to know how many threads are executed at the same time.

I’m currently developing on the SoC Tegra K1. TK1 has 1 SMX which contain 192 cuda core.
I’m processing Image with 1024x1024 dimension. I decided to create 1024 blocks with 1024 threads in each block to have Number of pixels = Number of threads

-> gridSize(32,32); //1024 blocks
-> blockSize(32,32); //1024 threads

I also know that we can launch only 1 block at the same time because there is only 1 SMX. The GPU won’t execute 1024 threads at the same time because it owns only 192 core. Is the GPU executing 192 threads at the same time (while others are waiting) ?

Thanks!

Tegra K1 has 1 SM with 4 schedulers. Each scheduler can dual-issue each cycle. Theoretical instruction issue rate is 4 schedulers x 2 instructions/scheduler x 32 threads/instruction = 256 threads/cycle.

Due to instruction mix and instruction fetch it is almost impossible to sustain higher than 7 instructions/cycle. 5-6 instructions/cycle is very high.

  • The SM can manage more than 1 block at a time. See occupancy and device limits in the programming guide.
  • The number of CUDA core is the number of fp32 thread instructions that can be processed per cycle not the number of threads that can be issued. The SM also has special purpose floating point units, load store units, fp64 units, branch units, etc.

Each scheduler can dual-issue each cycle. Theoretical instruction issue rate is 4 schedulers x 2 instructions/scheduler x 32 threads/instruction = 256 threads/cycle.

AFAIK, dual-issue ability means two instructions from the same thread on all GeForces starting from CC 2.0, isn’t it??

CC2.0 can issue 2 instructions from different warps.
CC2.1-6.x schedulers can dual issue 2 consecutive instructions from the same warp. For CC3.0-6.x the pairing is defined by the compiler.

yes, so f.e. one SMX can execute 128 command pairs, but commands in each pair should go from the same thread. Overall, in single cycle it executes up to 256 commands from 128 threads

Answering the original question: some CPUs have HyperThreading that allows to execute 2 threads on the same core, sharing its resources between those threads. GPUs also can run multiple threads per core simultaneously. In particular, recent NVidia GPUs can execute up to 16 threads/core, well, if we define “core” in proper way (4 schedulers x 32 cores/scheduler = 128 cores per SMX). So, one SMX can execute up to 2048 threads simultaneously, therefore two 1024-thread blocks can be run concurrently, given sufficient other resources.

Note that our discussion with Greg was about number of threads executing in a SINGLE GPU cycle. It’s limited to 128, equal to the number of cores computed in my way. But each core stores state for 16 threads simultaneously, so it can execute commands from other threads in other GPU cycles, allowing all 1024 threads in a thread block to exchange information, synchronize and so on.

Each SMX has 4 warp schedulers. Each warp scheduler selects a warp and can single-issue or dual-issue for the warp. This would mean 4 warps x 2 instructions x 32 = 256 thread instructions issued per cycle.

Answering the original question: some CPUs have HyperThreading that allows to execute 2 threads on the same core, sharing its resources between those threads. GPUs also can run multiple threads per core simultaneously. In particular, recent NVidia GPUs can execute up to 16 threads/core, well, if we define “core” in proper way (4 schedulers x 32 cores/scheduler = 128 cores per SMX). So, one SMX can execute up to 2048 threads simultaneously, therefore two 1024-thread blocks can be run concurrently, given sufficient other resources.

In my opinion this is not the correct way to compare a CPU to a GPU. My recommendation would be to compare

  • 1 SM = 4 CPU cores
  • 1 warp = 1 CPU SMT thread
  • 32 threads/warp = CPU SIMD unit with 32 lanes
  • 1 GPU CUDA core = 16-32 ALU datapaths (or less if SIMD datapath)

Many CPU cores can issue 2-7 instructions per cycle. For example, high end ARM cores can dual or triple issue. i7 core can issue many more.

The GPU has more SMT threads (warps) in order to handle latency. CPU has lower memory latency and depends on out of order execution to hide latency with less threads.