why is writing int faster than uchar4? (gets 64b writes instead of 32b)

Any ideas why the following code (out is of type uchar4)

uchar4 oval = make_uchar4(rintf(val.x), rintf(val.y), rintf(val.z), rintf(val.w));
*(int *)(out + __umul24(y, outStride/sizeof(uchar4)) + x) = *(int *)&oval;

Gets 64b writes and is thus faster than the following code

uchar4 oval = make_uchar4(rintf(val.x), rintf(val.y), rintf(val.z), rintf(val.w));
*(out + __umul24(y, outStride/sizeof(uchar4)) + x) = oval;

which seems to write 8bits instead of 32bits (32b warp writes) and is thus slower?

Also the way I use stride seems to change behavior as well

this which is potentially wrong (if stride isn’t a whole multiple of type)
out + __umul24(y, outStride/sizeof(uchar4))
can sometimes get good write behavior, while
(uchar4 *)((char *)out + __umul24(y, outStride))
usually makes the compiler perform bad (i.e shorter) write patterns

thanks