I thought we use vector lengths as multiple of 32 because a warp has 32 CUDA cores.
A warp has 32 threads, not cores. Threads, warps, and blocks are programming elements. Compute units, cores, and multi-processor are hardware elements. Multiple warps may be actively running at the same time on different computing elements.
Do I still have to use vector length of 32?
The floating point precision you’re using has no effect on the vector length. While using double precision may require threads to share resources and thus effect performance, this does not change how you program.
If CUDA cores are for single precisions, what do they do during double precision operations?
That will depend. If you have another warp performing single precision or integer instructions, then the other compute units will be occupied with these instructions. If all the active warps are only performing double precision, then you’d have inactive elements.
See the section labeled “Streaming Multiprocessor (SMX) Architecture” for more details.