./deviceQuery tells me that TX2 accommodates a maximum number of 2048 threads per multiprocessor, and the warp size is 32. This means that a multiprocessor can handle 2048 / 32 = 64 warps.
If I am not very mistaken, a warp itself can only be scheduled on a single core. That is, 64 warps map to 64 cores. but TX2 has 128 CUDA cores per multiprocessor. If that is the case, only half of the CUDA cores are needed (although I am sure this is a wrong conclusion)? Am I missing something here?