Why GPU has large memory bandwidth than CPU?

1, Why GPU has large memory bandwidth than CPU?
2, Is there any optimization technique to explicitly utilize the large memory bandwidth? Coalesced memory access?

The high-end NVIDIA GPUs have much, much wider buses and higher memory clock rates than any CPU. The Intel processor with highest memory bandwidth is the Core i7, and it has a memory bus which is 192 bits wide, with a memory clock (effectively) up to 800 MHz. The fastest NVIDIA GPU is the GTX 285, and it has a memory bus which is 512 bits wide, and a memory clock of 1242 MHz.

(For the nit pickers in the audience, the DDR3 memory clock I’m using above is the I/O bus clock, which is comparable to the memory clock on NVIDIA GPUs. In both cases, you get a transfer on the rising and falling edge of the clock.)

Yes, coalesced memory access, as described in the programmers guide, is key to maximizing the memory bandwidth. Such a wide bus is best used when transfering large, contiguous blocks of data. “Coalescing” is NVIDIA’s term for achieving this by having each thread in a warp access a neighboring memory location.

Thanks!

One more question, why GPU adopts such design? why CPU doesn’t adopt such design (wider bus and higher clock rate)?

GPUs were originally designed for 3D rendering, which requires processing a large dataset of polygons and textures. The amount of data that a GPU has to repeatedly process is much larger than a L2/L3 cache could hold, so the only way to improve rendering performance was to make the memory bus wider and faster. This increases the price of GPUs and also requires the use of more expensive memory chips. (The GTX 285 costs $350, but only has 1 GB of memory. A Core i7 CPU with 6 GB of memory can be purchased for roughly the same price.)

CPUs run a wide range of programs, many of which do not have a large “working set” of data. A substantial fraction of the data used by a typical CPU program fits into the L2 or L3 cache, and those on-chip caches are much faster than the off-chip memory bus of a GPU. Moreover, CPU programs tend to have much more random memory access patterns, which would not derive much benefit from a wide memory bus.

GPUs and CPUs have different designs mostly because they are trying to optimize cost and performance for different problem domains. Data-parallel algorithms favor the GPU design tradeoffs, whereas single-threaded programs and task parallelism (like pthreads) favors the CPU design tradeoffs.