Any ideas why the following code (out is of type uchar4)
uchar4 oval = make_uchar4(rintf(val.x), rintf(val.y), rintf(val.z), rintf(val.w));
*(int *)(out + __umul24(y, outStride/sizeof(uchar4)) + x) = *(int *)&oval;
Gets 64b writes and is thus faster than the following code
uchar4 oval = make_uchar4(rintf(val.x), rintf(val.y), rintf(val.z), rintf(val.w));
*(out + __umul24(y, outStride/sizeof(uchar4)) + x) = oval;
which seems to write 8bits instead of 32bits (32b warp writes) and is thus slower?
Also the way I use stride seems to change behavior as well
this which is potentially wrong (if stride isn’t a whole multiple of type)
out + __umul24(y, outStride/sizeof(uchar4))
can sometimes get good write behavior, while
(uchar4 *)((char *)out + __umul24(y, outStride))
usually makes the compiler perform bad (i.e shorter) write patterns
thanks