An SP is a functional unit that can receive (ie. begin processing) one floating-point instruction (FP Add, FP Multiply, or FP Multiply-Add) on each clock cycle. Scheduling such an instruction across a warp requires 32 SP “cores”. Other types of instructions get issued to other types of functional units on the device.
This is likely going down a thought path that is incorrect. A Kepler SM (like the one in TK1) can indeed issue a maximum of 6 warps (i.e. 6 instructions) that are FP-Add, Multiply, or Multiply-Add, in a single cycle (in practice we rarely observe this). However, other instructions get executed on other types of functional units in the SM, so this does not provide the full picture. Furthermore, it would not be correct to say that we want to write programs that consist of 6 warps, to maximally utilize the GPU.
My use of FP- above refers to single precision (i.e. FP32) arithmetic. Double precision have different functional units and different ratios.
To get an idea of what other arithmetic functional units exist on a typical SM, and their typical throughput ratios (which is roughly equivalent to indicating how many of each type of functional unit there are in a SM) refer to the programming guide:
Not sure I understand this question. A device and a GPU would normally be synonymous (i.e. they mean the same thing). Some “GPUs” like K80 have 2 devices in a single board/product. In the case of K80, each device has its own separate global memory.
Note that questions like these are mostly answered in the programming guide, and furthermore have been answered many times over on various web forums like this one. A bit of googling research will likely provide answers for these types of questions. For instance, here is a duplicate of your first question, that I found with a small bit of googling effort: