Yes, the warp is the smallest unit of execution on the device. But the answer to your last question depends on what you mean by “in parallel”. If you mean precisely exactly simultaneously, then 16*32 threads are not executed “in parallel”. Each warp is processed in a burst of 4 clock cycles on 8 ALUs.
Memory/computation interleaving is where you can truly harness the power of the GPU, and that requires having more than one warp running concurrently on the same multiprocessor. When one is waiting on memory, the others can be executing arithmetic instructions. Each multiproc can handle up to 768 threads, so the number of concurrent threads is 768*16 (assuming 100% occupancy). This is a more sensible definition of “in parallel” on the GPU, because it is the number of active threads in flight actively processing data.