I am trying to understand why inserting a loop into a memcpy kernel can drastically reduce I/O performance.
I recently read a great article about “Better Performance at Lower Occupancy” by Vasily Volkov @ UC Berkeley. In trying to understand Fermi architecture I/O, I went ahead and wrote my own memcpy functions. My first memcpy based on Vasily’s code ran at about 162.5 GB/s on my GTX 480 card which is about 89% of peak (162.5/182.4). Note: This is for a very large data set just shy of 100 million 32-bit elements. The CTA was a grid<960,400,1> and block<64,1,1>. I then tried to eliminate the 2nd dimension (IE columns) from the grid by moving them inside the kernel as a loop over fixed size chunks of work. This dropped performance down to 138 GB/s. Increasing the # of elements per thread and inserting __syncthreads got performance back up to 148 GB/s. I then rewrote the code to eliminate the need for __syncthreads while going back down to 4 elements per thread, but still only achieved 148 GB/s, which is 81% of peak (148/182.4).
Looking at the generated .PTX code, the loop code looks pretty efficient (1 label, 3 adds, 1 compare, 1 predicated branch) so I am unsure what is killing performance in the looping version of the code. Based on the same article by Vasily, I tried copying data as "uint4"s, I eventually achieved 157 GB/s but had to create an awkward CTA to achieve this (Grid = 120x1, Block = 512x1, Each block processes 50 fixed size chunks, each chunk contains 16384 elements, or in other words each thread processes 8 uint4s (32 elements) per chunk of work.
Does anyone have any insights if “looping” actually kills performance that bad or is there some other architectural issue going on? For instance I noticed in the Nvidia Compute Visual Profiler that the fast memcpy has about 9% Global replays vs. 22% global replays for the slower looping memcpy, subtracting the two gets about an 11% performance differential which is about what I’m seeing (89/81 = 1.098)
Memory constraints on the GTX 480 include an L1 cache (128 bytes at at time, 16K per SM), L2 cache (8 bytes at a time, 768K capacity) with 6 memory controllers, 1.5 GB Global memory. Meanwhile, there are 15 SM’s with 2 warp schedulers each working on up to 16 warps each to move data around.
Thanks in advance for your feedback.
P.S. I’ve included a stand-a-lone test file that includes the 3 different kernels (fast, loop, and uint4) and a CPU host wrapper for testing the I/O throughput of each kernel.