Fast vs. Slow memcpy Trying to understand GPU I/O via memcpy

I am trying to understand why inserting a loop into a memcpy kernel can drastically reduce I/O performance.

I recently read a great article about “Better Performance at Lower Occupancy” by Vasily Volkov @ UC Berkeley. In trying to understand Fermi architecture I/O, I went ahead and wrote my own memcpy functions. My first memcpy based on Vasily’s code ran at about 162.5 GB/s on my GTX 480 card which is about 89% of peak (162.5/182.4). Note: This is for a very large data set just shy of 100 million 32-bit elements. The CTA was a grid<960,400,1> and block<64,1,1>. I then tried to eliminate the 2nd dimension (IE columns) from the grid by moving them inside the kernel as a loop over fixed size chunks of work. This dropped performance down to 138 GB/s. Increasing the # of elements per thread and inserting __syncthreads got performance back up to 148 GB/s. I then rewrote the code to eliminate the need for __syncthreads while going back down to 4 elements per thread, but still only achieved 148 GB/s, which is 81% of peak (148/182.4).

Looking at the generated .PTX code, the loop code looks pretty efficient (1 label, 3 adds, 1 compare, 1 predicated branch) so I am unsure what is killing performance in the looping version of the code. Based on the same article by Vasily, I tried copying data as "uint4"s, I eventually achieved 157 GB/s but had to create an awkward CTA to achieve this (Grid = 120x1, Block = 512x1, Each block processes 50 fixed size chunks, each chunk contains 16384 elements, or in other words each thread processes 8 uint4s (32 elements) per chunk of work.

Does anyone have any insights if “looping” actually kills performance that bad or is there some other architectural issue going on? For instance I noticed in the Nvidia Compute Visual Profiler that the fast memcpy has about 9% Global replays vs. 22% global replays for the slower looping memcpy, subtracting the two gets about an 11% performance differential which is about what I’m seeing (89/81 = 1.098)

Memory constraints on the GTX 480 include an L1 cache (128 bytes at at time, 16K per SM), L2 cache (8 bytes at a time, 768K capacity) with 6 memory controllers, 1.5 GB Global memory. Meanwhile, there are 15 SM’s with 2 warp schedulers each working on up to 16 warps each to move data around.

Thanks in advance for your feedback.


P.S. I’ve included a stand-a-lone test file that includes the 3 different kernels (fast, loop, and uint4) and a CPU host wrapper for testing the I/O throughput of each kernel. (34.3 KB)

This is an interesting question. I’ve been trying to dig into this, but the behavior of the GPU seems to be complex and hard to explain.

One thing I can say for now is that maybe memcpy is not the best way to approach this. Performance of memcpy is guided by two separate processes, reading and writing, whose performance depends on your code in different ways. To take the loop version specifically, I can approach the theoretical writing bandwidth relatively easily on a 560, using short loops and as little as 1 32-bit write per loop, reaching 122 Gb/s on a 560 (the theoretical max at my clocks is 131 Gb/s). On the other hand, reading works optimally with long loops and multiple reads per loop.

More generally, one has to wonder why we can’t get full bandwidth with just about any kernel configuration and access pattern, as long as reads from each warp are fully coalesced into 128-byte accesses and all SMs are fully loaded.

Or even without full loading. I suspect that it has to do with DDR3 seek latency. It takes only 1 ns to transfer 128 bytes from global memory to the GPU on a 560 (0.7 ns on a 480). But native latencies of random access to DDR3 memory measure in tens of nanoseconds. So even one seek operation per 10-20 consecutive reads will have noticeable impact on the performance. With 8 cores sending multiple read requests to L2 for locations all over the address space, minimizing seeks may be a challenge.

Thats great! Your write-only solution is running at about 93% of peak (122/131). You could create a very fast memset function using your approach.

How short is a short loop? If it is a fixed sequence of say 4 elements or less than the NVidia compiler is most likely unrolling the loop away. For instance, in the first fast kernel based on Vasily’s code. The short loop of 4 reads followed by another short loop of 4 writes all completely disappear when you take a look at the optimized .PTX code.

Isn’t the difference between reading and writing influenced by the L1/L2 cache? IE reading uses both the L1/L2 cache whereas writing uses just the L2 cache. The L1 cache according to the documentation works with 128 bytes at a time (32 32-bit integers) for a warp sized view of memory. However, The L2 cache according to the documentation works with 32 bytes at a time (8 32-bit integers) for a 1/4-warp sized view of memory. This is further complicated by having 6 memory controllers but 15 SM’s with up to 8 concurrent blocks all trying to read/write at the same time. I assume NVidia uses hardware queues on each memory controller to help manage this competition.

In the case of a long sequential operation like copying (read/write), memset (write-only), or reduction (read-only), the cache doesn’t get much reuse as each new coalesced I/O request is for a new piece of data. So in a sense, memcpy is a worse case scenario for the caching architecture. The L1/L2 caching is actually just extra layers of overhead between global memory, the memory controllers and the SMs.


Thanks for the new information.

So DD3 (or DD5) RAM works best with sequential access. But full sequential access is hard to achieve when you have 15 SMs competing in parallel for access to the memory. Each SM may have bursts of sequential I/O activity but collectively their I/O access is effectively randomly interleaved, resulting in seeks that kill performance.

If spotted two things so far:


There are still two [font=“Courier New”]__syncthreads()[/font] in the supposed [font=“Courier New”]__syncthreads()[/font]-free version, which probably reduce performance a bit.

The calculation of [font=“Courier New”]warpCol[/font] is wrong, comes out as [font=“Courier New”]tid[/font] instead of [font=“Courier New”]tid%32[/font]:

const U32 logWarpSize    = 5u;

   const U32 WarpSize       = 1u << logWarpSize;

   const U32 WarpMask       = WarpSize - 1u;
// Get memory location

   U32 warpRow = threadIdx.x >> logWarpSize; // tid / 32

   U32 warpCol = threadIdx.x & WarpMask;     // tid % 32

I’m not sure how this influences performance - the fact that different warps might access the same memory locations might slow down the kernel a bit, but more importantly the resulting out-of-bounds access should make the kernel appear to be very fast.

The …loop_nosync kernel has no __syncthreads() they are all commented out. You must be referring to the …_uint4() kernel which does have 2 syncthreads() which I inserted because

they actually improve performance. Though I don’t know why…

This is a bit-trick for powers of 2 for getting the modulus quickly without paying the full cost of a true modulus operation.

This is equivalent to

U32 warpCol = threadIdx.x & 31;     // tid % 32

which in bit-wise notation would be

U32 warpCol = threadIdx.x & b11111;

So, The code is effectively masking the bottom 5 bits off threadIdx.x to obtain values

in the range [0…31] which is equivalent to the more expensive ‘threadIdx.x % 32’



One other difference as is that each reading request HAS to go through to the global memory, but the writing request can stop at L2. Any well-designed L2 will try to coalesce accesses into longest blocks possible (because of the seek problem). But that’s much easier to do with writing (you just wait till you have enough sequential data to write) than with reading (for example, before L2 can initiate a global request for 512 bytes starting at X, it needs to have four queued read requests for X, X+128, X+256, and X+384, from four different warps somewhere in the GPU).

At this point I’m deep into speculation territory, I don’t have any hard info on how the L2 cache in Fermi operates. I can’t even find hard data on GDDR5 memory timings (but DRAM is DRAM, and, at those frequencies, we should be seeing worst-case-scenario seek times of 40 clocks or more). So take it for what it’s worth.

Can the performance be improved by spreading the accesses evenly among the channels (partitions) to maximize global memory throughput : Partition Camping