You’ll never fill up all 768 threads with half-warps. The device executes whole warps as the smallest unit of execution. Half-warps only come into play with shared memory banks.
To answer your other question, the manual says:
“The way a block is split into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Section 2.2.1 describes how thread IDs relate to thread indices in the block.” (section 3.2)
“Each thread is identified by its thread ID, which is the thread number within the block. To help with complex addressing based on the thread ID, an application can also specify a block as a two- or three-dimensional array of arbitrary size and identify each thread using a 2- or 3-component index instead. For a two-dimensional block of size (Dx, Dy), the thread ID of a thread of index (x, y) is (x + y Dx) and for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).” (section 2.2.1)
So, threads with consecutive thread ids will be merged into warps.
However, while “the manual says so” is often used as the final word, some tests seem to suggest that this thread-warp merging is not the case: http://forums.nvidia.com/index.php?showtopic=57779
Specifically, look at the timings for the 64x4 and the 4x64 blocks sizes. If consecutive threads are indeed merged into warps, then these two sizes should have the same timings. The results there seem to indicate that threads are not merged into warps across multiple rows of the block.
Though, because this test uses a texture read, the performance differences in 4x64 vs 64x4 may be due to the 2D texture cache… but I don’t find it likely.