thread, warp, block, grid, device

wsm · November 25, 2016, 3:35pm

Hello everyone,
I have read a lot about this, but its not fully clear to me.

I have a Jetson TK1 with 1 Streaming Multiprocessors (SM) of 192 Cuda Cores, also called Stream Processors (SP).

What I have read:
Threads in a Block are grouped in Warps of 32 Threads and warps are executed parallel.
Warps from different Blocks can by executed on one SM.

Can threads from different blocks be in the same warp?

How many threads are executed on one SP? Intuitively I would say 1.
If so, then 192/32= 6 Warps maximum parallel executed on the TK1.

I know that threads are grouped in blocks and blocks in grids.

Is a grid all blocks of one kernel call?

And Device is the number of GPU’s? Do each device has it’s own globel memory?

Thanks in advance!

Robert_Crovella · November 25, 2016, 6:53pm

No.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture

An SP is a functional unit that can receive (ie. begin processing) one floating-point instruction (FP Add, FP Multiply, or FP Multiply-Add) on each clock cycle. Scheduling such an instruction across a warp requires 32 SP “cores”. Other types of instructions get issued to other types of functional units on the device.

This is likely going down a thought path that is incorrect. A Kepler SM (like the one in TK1) can indeed issue a maximum of 6 warps (i.e. 6 instructions) that are FP-Add, Multiply, or Multiply-Add, in a single cycle (in practice we rarely observe this). However, other instructions get executed on other types of functional units in the SM, so this does not provide the full picture. Furthermore, it would not be correct to say that we want to write programs that consist of 6 warps, to maximally utilize the GPU.

My use of FP- above refers to single precision (i.e. FP32) arithmetic. Double precision have different functional units and different ratios.

To get an idea of what other arithmetic functional units exist on a typical SM, and their typical throughput ratios (which is roughly equivalent to indicating how many of each type of functional unit there are in a SM) refer to the programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

Yes.

Not sure I understand this question. A device and a GPU would normally be synonymous (i.e. they mean the same thing). Some “GPUs” like K80 have 2 devices in a single board/product. In the case of K80, each device has its own separate global memory.

Note that questions like these are mostly answered in the programming guide, and furthermore have been answered many times over on various web forums like this one. A bit of googling research will likely provide answers for these types of questions. For instance, here is a duplicate of your first question, that I found with a small bit of googling effort:

https://devtalk.nvidia.com/default/topic/462166/cuda-programming-and-performance/can-threads-in-a-warp-from-different-blocks-/

Robert_Crovella · November 25, 2016, 6:58pm

.

wsm · November 25, 2016, 8:16pm

Thank you for your answer txbob!

So the number of executed threads per cycle depends on the instruction-type.
How can I see the number of used cores?
Does it make sence to make my Blocksize, whenever possible, a multiple of 32?

Topic		Replies	Views
CUDA threads and warps Teaching & Curriculum Support	3	7924	May 12, 2015
CUDA threads and warps CUDA Setup and Installation	0	657	January 16, 2015
help me understand cuda CUDA Programming and Performance	4	6933	February 10, 2010
Thread Scheduling Concept CUDA Programming and Performance	3	3848	June 21, 2012
How more exactly a thread is executed on GPU CUDA Programming and Performance	9	3172	March 7, 2017
About Warps how Warps are allocated to SP/SM CUDA Programming and Performance	2	8413	September 11, 2009
CUDA execution mapping onto GPUs CUDA Programming and Performance	0	2847	March 2, 2009
CUDA hardware level: Streaming Multiprocessor CUDA Programming and Performance	1	2676	April 27, 2015
Whats a WARP for? CUDA Programming and Performance	8	6621	June 21, 2007
Blocks/Warps/Threads Allocation I have some doubts about the allocation of blocks/warps/thread in CU CUDA Programming and Performance	5	2656	November 1, 2012

thread, warp, block, grid, device

Related topics