Write Combined Memory How it enhences performance?


In the CUDA 2.2 Pinned Memory APIs documentation (simpleZeroCopy SDK example), it is said that :
“Writes to WC memory are not cached in the typical sense of the word
cached. They are delayed in an internal buffer that is separate from the internal L1 and L2
caches. The buffer is not snooped and thus does not provide data coherency.”
I am wondering if Writes to WC are not cached how would someone get performance?


The performance comes from queueing up writes (combining) in order to maximize the throughput of the eventual write. This should sound similar to what’s described and encouraged in the CUDA docs.

This Intel doc from 11/1998 is a good place to start: Write Combining Memory Implementation Guidelines.

Actually, re-reading that document carefully reveals that writes are cached in a way. Reads aren’t. This means that the CPU doesn’t have to monitor the PCI-E bus in order to keep its cache up to date, and therefore the bus can work faster. This makes sense if you use a buffer exclusively for host<->device transfer, e.g. as an intermediate for exchange between two GPUs. There might be other similar scenarios.