Global memory access cost

Is global memory access still 300 cycles? Someone very knowledgeable about CUDA quoted that number to me months ago, but I wonder if his metric was a bit dated. My understanding is that a basic arithmetic fp32 add or mult is about four cycles, but if global memory access is 300 cycles (let’s just assume it is one of those clean, coalesced reads, not a messier case where a warp has to read from multiple sectors to satisfy every thread’s data requirements), I feel like things don’t make as much sense. I’ve tried a couple of times, weeks of effort in each case, to move things around and minimize global memory access requirements. Reduced them by a factor of 10 in some cases. Doesn’t make nearly the dent in performance I was expecting. I think I recall on this list seeing a figure of 50 cycles for global memory access, and if that’s a more accurate metric for more recent architectures that is much more in line with the performance returns I have been seeing. Can anyone help set the record straight?

It varies from GPU to GPU. And in general, latencies have gotten somewhat shorter over the span of CUDA existence.

It’s not obvious to me how GPU memory latency is connected with your statements. The GPU is intended to be a latency hiding machine. If your efforts to improve GPU access patterns don’t seem to be paying any results, one possibility is that your code is not bound by memory bandwidth. It may still be bound by memory latency. In this case, you are in a latency-bound situation, and the usual advice there is to expose more work for the GPU to do (probably, effectively driving up the memory bandwidth demand per unit time of your code).

A reduction in memory usage doesn’t necessarily help a latency bound code proportionally to the reduction. Suppose memory latency is 10 cycles. Suppose I have to read 4 items to make my calculation:

C0  C1  C2  C3  C4  C5  C6  C7  C8  C9  C10 C11 C12 C13 C14 C15
R1  R2  R3  R4  XL  XL  XL  XL  XL  XL  XL  R1  R2  R3  R4  Answer

Rx = read request, or read response from memory
Cx = cycle
XL = latency (idle) cycle

In this silly example, the answer could be computed trivially in the 16th clock cycle, after all 4 read requests were satisfied from memory. Now lets suppose I work really hard and reduce those 4 read requests to 2:

C0  C1  C2  C3  C4  C5  C6  C7  C8  C9  C10 C11
R1  R2  XL  XL  XL  XL  XL  XL  XL  R1  R2  Answer

Now the answer can be computed in the 12th cycle, after working really hard to reduce the memory load by 2x. What if you work really hard to reduce it by 2x again? You get about a 10% benefit. The solution is not to worry about what the actual latency is, in this case, but to load the GPU up with additional, useful work, so as to fill those empty latency cycles.

Of course I don’t know whether your code is latency bound or whether any of this matters, but the reason why latencies aren’t published, and generally not that easy to find, are twofold:

  1. they are variable, even in the same GPU. actual latency depends on what else is going on, and certain varies from GPU type to GPU type
  2. you’re really not supposed to have to worry about it as a GPU programmer

The latency is probably what I was missing in my analysis. I’ve also got 6-8 blocks per SM, which gives me ample opportunity to hide the read costs (let the scheduler do the work, as you say). The code I am replacing also got good mileage out of textures to cache things that I’m reading straight from global, so I am simplifying even if not making that much perf improvement. But my understanding of what goes on in those SMs is improving.

From a performance optimization perspective, it’s usually good to start with the profiler. Let the profiler tell you what the performance limiters are in your code.

If you google something like “gtc optimization” or “gtc analysis driven optimization” you will find lots of good material.

I am skeptical that global memory latencies have actually been declining on a per-cycle basis. For example, some sources claim that the higher latency of GDDR5X memory (compared with GDDR5) is having a negative impact on crypto currency mining performance when the miners run on NVIDIA GPUs.

But as txbob says, GPUs are built for high throughput, not low latency, so everything from pipeline length, to cache latency, to global memory latency tends to be higher than on CPUs, which have been optimized for low-latency operation over the past thirty years.

The main latency-hiding mechanism in GPUs is zero-overhead thread switching. So you would want lots of threads running concurrently, on the order of tens of thousands. You want to partition the threads into blocks small enough so that multiple blocks to work concurrently for work distribution with fairly fine granularity.

A good starting point is to target thread blocks of size 128-256 threads, and have at least 20 times as many thread blocks in a grid as are able to run concurrently, so that would be several hundred threads blocks as a minimum. Note that these are starting points. I am aware that some use cases work best with very small thread blocks, down to 32-thread blocks even. I don’t recall whether yours falls into that category.

The compiler can (and does!) also help hide latency by re-arranging the instructions such that long-latency instructions, in particular memory loads, get scheduled early in an instruction stream. Programmer’s can increase the “mobility” of loads in this process by using restricted pointers (restrict attribute), which is a promise to the compiler that there is no aliasing. See the Best Practices Guide.

I assume you are already familiar with the importance of coalesced accesses to maximize effective memory bandwidth. As txbob says, the CUDA profiler allows you to effectively zero in on the bottlenecks in your code, and provides helpful metrics showing how efficiently memory bandwidth is being used.