Performance breakdown with vector types uint8 worse than uint4

Hi all,

I was wondering why ATI and Nvidia show different patterns in performance. While ATI GPUs show a performance increase while increasing the width of a vector type, in my case from uint4 to uint8 to uint16, using uint8 with CUDA shows worse performance than uint4.

So far I’m not interested in comparing both architectures, I’m just trying to figure out why uint8 shows a performance breakdown from uint4. Any ideas?