I have a little code which uses zero-copy buffer (4B) for an integer. In the kernel, each thread execute only one statement: atomicAdd(&count, 1).
I saw a large number of stall caused by memory throttle. Compare to the version that uses normal device memory buffer, the performance is much more worse.
Anyone can help me to understand it? Does it mean that for each write from the kernel, the PCIE data transfer is involved?
Unless you are on an integrated platform where system and GPU memory are the same physical memory, such as Tegra, zero-copy involves the GPU making direct access to the host’s system memory across PCIe. This is a high latency, low bandwidth access. In your case the high-latency aspect is what negatively impacts performance as it causes stalls; even with many threads running on the GPU there may not be sufficient parallelism to cover the entire access latency.