Do we need memory coalescing on CPUs, as we need it on GPUs.
GPUs require memory coalescing to aggregate the memory accesses from multiple threads together into a single transaction.
CPUs don’t execute threads in lock-step, so the same concept of coalescing between threads isn’t necessary. That being said, CPUs have
complex memory hierarchies that require programs to obey restrictions to deal with alignment, cache line sizes, dram buffer sizes,
prefetchers, false sharing, write-combining buffers, and a variety of other subtleties. Not considering these restrictions will
lead to reduced memory performance, similar to how not coalescing memory accesses will reduce memory bandwidth on a GPU.
Thanks Diamos, how about using int4 and float4 on cpu, can we benefit from them,
can they increase the cache locality?
The closest equivalent might be the SSE2 registers. There are load/store instructions for 4 floats and ints simultaneously, which is likely faster than doing four separate loads.