thread, warp, block, grid, device

Hello everyone,
I have read a lot about this, but its not fully clear to me.

I have a Jetson TK1 with 1 Streaming Multiprocessors (SM) of 192 Cuda Cores, also called Stream Processors (SP).

What I have read:
Threads in a Block are grouped in Warps of 32 Threads and warps are executed parallel.
Warps from different Blocks can by executed on one SM.

Can threads from different blocks be in the same warp?

How many threads are executed on one SP? Intuitively I would say 1.
If so, then 192/32= 6 Warps maximum parallel executed on the TK1.

I know that threads are grouped in blocks and blocks in grids.

Is a grid all blocks of one kernel call?

And Device is the number of GPU’s? Do each device has it’s own globel memory?

Thanks in advance!

No.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture

An SP is a functional unit that can receive (ie. begin processing) one floating-point instruction (FP Add, FP Multiply, or FP Multiply-Add) on each clock cycle. Scheduling such an instruction across a warp requires 32 SP “cores”. Other types of instructions get issued to other types of functional units on the device.

This is likely going down a thought path that is incorrect. A Kepler SM (like the one in TK1) can indeed issue a maximum of 6 warps (i.e. 6 instructions) that are FP-Add, Multiply, or Multiply-Add, in a single cycle (in practice we rarely observe this). However, other instructions get executed on other types of functional units in the SM, so this does not provide the full picture. Furthermore, it would not be correct to say that we want to write programs that consist of 6 warps, to maximally utilize the GPU.

My use of FP- above refers to single precision (i.e. FP32) arithmetic. Double precision have different functional units and different ratios.

To get an idea of what other arithmetic functional units exist on a typical SM, and their typical throughput ratios (which is roughly equivalent to indicating how many of each type of functional unit there are in a SM) refer to the programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

Yes.

Not sure I understand this question. A device and a GPU would normally be synonymous (i.e. they mean the same thing). Some “GPUs” like K80 have 2 devices in a single board/product. In the case of K80, each device has its own separate global memory.

Note that questions like these are mostly answered in the programming guide, and furthermore have been answered many times over on various web forums like this one. A bit of googling research will likely provide answers for these types of questions. For instance, here is a duplicate of your first question, that I found with a small bit of googling effort:

https://devtalk.nvidia.com/default/topic/462166/cuda-programming-and-performance/can-threads-in-a-warp-from-different-blocks-/

.

Thank you for your answer txbob!

So the number of executed threads per cycle depends on the instruction-type.
How can I see the number of used cores?
Does it make sence to make my Blocksize, whenever possible, a multiple of 32?