According to most NVidia documentation CUDA cores are scalar processors and should only execute scalar operations, that will get vectorized to 32-component SIMT warps.
But CUDA (and OpenCL) have vector types like for example uchar4. It has the same size as uint (32 bit), which can be processed by a single scalar core. If I do operations on a uchar4 vector (for example component-wise addition), will this also map to an instruction on a single core?
If there are 1024 work items in a block (work group), and each work items processes a uchar8, will this effectively process 4090 uchar in parallel?