Fermi - Performance of offsetcopy kernel

Greetings,

Has anyone tested out the performance of offsetcopy kernel (refer to figure 3.7 in CUDA Best Practices Guide 3.0) for different offsets? I don’t see too much performance degradation for the offsetcopies for the Fermi cards (less than 5%). It would be helpful if I can corroborate my findings with others. Furthermore, what does it exactly mean when the memory transaction is cached (I assume this is the big change in Fermi that allows misaligned memory access to have trivial effect)?

Global memory access in fermi is completely different from accesses with a 1.x device.
First of all, the mem transaction size is 128 bytes regardless of the data type size. So an unaligned copy is twice as slow (2 aligned transactions needed) as an aligned copy on cache misses.
For this simple copy kernel, the cache will probably compensate for the unaligned accesses.

N.

Forgot to answer your cache question :)

Shared memory for fermi devices can be arranged to 16KB shared/48KB cache or vice versa.
The cache lines are used to store data for memory fetches, so if a thread requires data
that is already available in the cache (eg. has already been fetched earlier on by a misaligned copy in your example)
then it can read the data from the fast cache instead of requesting a new memory fetch.

N.

So the slight degradation in performance that I see is caused by the cache miss? Does the cache miss occur predominantly when the “offset” causes warps from different SMs to read and copy? And I assume that there would be more of a cache hit when these warps are located in the same SM, correct?

global void offsetCopy (float odata, float idata, int offset)

{

int xid = blockIdx.x * blockDim.x + threadIdx.x + offset;

odata[xid] = idata[xid];

}

I’ve also wondered if this is the residual overhead of having to generate two memory transactions for a half-warp instead of one, due to the misalignment.

So you are implying that even if we can get the data from the cache instantaneously and never have cache misses, there will still be the overhead cost, which contributes to the delay in the case of misalignment?

That’s one possibility. Haven’t come up with a microbenchmark to figure out which is going on.

IMHO, I think what’s happening is this.

Suppose you’re accessing 4-byte words, so a warp accesses 128bytes (= fermi mem-transaction size)
Let’s take positive offset of x with 0 <x < 32

thread 0 -> offset 4*(x+0) bytes
thread 1 -> offset 4*(x+1) bytes

thread i -> offset 4*(x+i) bytes

The first warp will need to issue TWO mem-transfers
one from 0-127 bytes to accomodate threads 0…(32-x-1) -> to cache line 0
one from 128-255 bytes to accomodate threads 32-x-1…31 -> to cache line 1

The second warp:
threads 32…(64-x-1) read from cache line 1
ONE mem transfer from 256-383 bytes to accomodate threads 64-x-1…63 -> to cache line 2

The third warp:
threads 64…(96-x-1) read from cache line 2
ONE mem transfer from 384-511 bytes to accomodate threads 96-x-1…95

etc.

So I believe the small degradation is due to the fact that you need one extra 128 byte transfer (see warp 1) for the whole block compared to the aligned case.

N.

This makes sense. But I suppose one can say that there is an additional source of degradation due to extraneous cache reads. Moreover, in order for this type of a scheme to work, the scheduler has to make sure that warps access memory in timely manner (e.g. warp 1 first, warp 2 second, etc.) in order to increase the chance of cache hit, correct? So I assume this would be different from 1.x in which warps need not follow any particular sequence to access global memory, right? I wonder if this additional restriction would cause some other performance issue depending on the details of the kernel.

I think many of the cache reads can be hidden within the latencies of the actual memory fetches so that they do not contribute to the total execution time.

The warps do not need to follow a particular sequence as the cache is rather large. If you take my previous example and apply it to an out-of-order case for the warps, you will arrive at the same conclusion provided the total amount of memory fetched for the block fits inside the cache. The order in which the warps are executed will become important if it doesn’t fit within the cache in which case there’s a chance of a cache miss on data that has been fetched before but has been overwritten

N.

So basically in your example, if warp 2 went before warp 1 in time, then threads 32…(64-x-1) would read from global memory and this data will be moved to cache line 1, which can be directly read by warp 1. And generally, whether or not warp N accesses data directly from the global memory or from the cache will be determined based on whether warp N-1 and warp N+1 have been executed. Does the thread scheduler (or some other hardware) handle this info or is it based on the outcome of cache hit/miss?

That’s basically how a cache works. If a thread requires a memory access, it first checks whether or not the data is available in the cache. If it is not in the cache, a memory fetch is issued and the result is stored in an empty cache line and passed to the thread for computation. If there’s no empty cache line, the scheduler will use some kind of scheme to determine which cache line should be overwritten by the newly fetched data. In most cases it’s a Least Recently Used (LRU)-scheme in which the cache line that has been accessed least recently is replaced.

Checking for a cache hit should be pretty fast because it’ll happen on the SM-level without the need to access off-chip memory. I believe it’s clear from your benchmark that the fermi cards have a pretty nifty scheduler!

Please keep in mind that I’m not an nvidia employee so my explanation to the original problem remains pure speculation :)

N.