Memory Wall, CUDA and NVIDIA NVIDIA GPU hiting Memory Wall?


I recently read an article here from the Sandia National Laboratory, which states that more chip cores can mean slower supercomputing. They concluded this on the basis of the simulation of some supercomputing applications that they carried out. The reason as per the study is the memory wall and contention for the resources. But I have seen NVIDIA’s website that shows speed-up of x400 on Tesla with 240 cores. Doesn’t that contradicts the study? Does that mean CUDA will not hit the memory wall (at least in near future)? If so why?
I think NVIDIA people can help me understand this better.

That Sandia study was simulating “conventional” many cored CPUs working on large dataset problems with an assumed shared memory bandwidth of 10GB/s. The Tesla C1060 has a memory bandwidth of 102GB/s…

After skimming through the article, the conclusion seems “you can’t just add cores, you need to make memory faster as well”. That’s not really ground breaking but true.

GPUs have higher memory bandwidth (sometimes by an order of magnitude as avidday mentioned) but it only works with proper access patterns and many kernels are bandwidth bound. So it’s not that CUDA is above and beyond the memory wall, they’ve just pushed it a bit further with smart engineering and a specialized architecture.

But, NVIDIA will have to continously push this wall further by the SMART ENGINEERING you refered. But the question is can NVIDIA guarantee that the current softwares written for the present GPUs will also work on the future GPUs, because of this quest for the farther memory wall.

Can Intel promise you that? how sure are you that you won’t have to learn Ct instead of regular C++ and pthreads?

No one can promise you anything… I guess nVidia and Intel are here to stay for the next few years (at least :) ) and

their products will only get better in time. Currently I dont think Intel/ATI can even compete with nVidia - performance wise.

Are you able to get x20-x50-x100 on any other hardware? the answer is probably no and therefore even if it will only last

5-10 years I guess its worth it - you need the speedup now.

I used to write a lot of code in Delphi - it was the great product ever - now its dead :( MS killed it. No one promissed Delphi would live for ever either…


The memory wall is more of a problem for single core architectures than multicore architectures. For single core architectures, frequency scaling meant that memory latency was more important than bandwidth because a cache miss would mean the CPU stalling for a number of cycles that increased as you scaled the CPU frequency.

For multicore processors and GPUs, you can just add more memory banks and memory controllers and easily increase the memory bandwidth of the system. You couldn’t do this for single core because it would not improve memory latency. The main problem for multicore is pin bandwidth. Although we have not hit a limit yet, the number of io pins for a processor is more or less fixed and you have to move all of your data (~150GB/s on high end nvidia gpus) across those pins.

The memory wall as such is the problem to get data from off-chip memory (DRAM) to the processing cores.

On CPUs, AMD had the lead for the last two years, because they included the memory controller (the bit of hardware that actually fetches data) on the die. NUMA. Each socket in a multi-CPU system had full memory bandwidth as long as each core in a socket accessed “local” data, data local to each socket. Intel stuck to the north/southbridge arch. With Core i7, Intel took the lead again by migrating to NUMA; at least in my microbenchmarks. The relevant point is that any NUMA arch scales memory performance by socket.

The interesting observation is that GPUs behave similarly: Replace socket by memory partition. There is a wicket crossbar switch so GPUs are not hardware-NUMA, but the effect on performance is the same. Full bandwidth (of 160+ GB/s) is only available if all “cores” or warps or multiprocessors access different memory banks in different memory partitions. Check the transpose sample in recent tutorials (probably also in the SDK, didn’t check) for “partition camping”: If all memory accesses hit the same memory partition, then your memory performance plummets.

In short: GPUs suffer from the memory wall just in the same way as CPUs. Just on a much higher level.

transposeNew is the sample you are referring to.

As far as I am aware, the term “memory wall” was first defined in this paper:

Wulf, W. A. and McKee, S. A. 1995. Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News 23, 1 (Mar. 1995), 20-24…Wall-wulf94.pdf

At least with NUMA architectures, you have a chance to address the problem at the algorithmic level. From the way that the memory wall was originally defined:

With single threaded, single core CPUs, there is nothing that you can do at the algorithmic level. Your program will have some finite number of cache misses, and Amdahl’s law says that if the time it takes to service these misses becomes significantly larger than the time it takes to process an instruction, servicing cache misses will dominate your application’s performance.

In your transpose example, you can assign different CTAs to different memory partitions, and as long as your program has this property, the performance can be improved by adding more GPU cores and more memory partitions.

I think that it is valuable to remember this point from the paper:

In their case the unstated assumption was that programs were single threaded and that there was a single cpu core and memory partition. Even if writing applications is more complex because the programmer has to take into account which GPU core accesses which memory partition, NUMA memories and multicore CPUs/GPUs seem to offer a way over the memory wall until we hit the next “wall”.

Gregory: Can you tell me what do you exactly mean when you say "At least with NUMA architectures, you have a chance to address the problem at the algorithmic level. "

I mean that you can write scalable code such that the memory accesses are distributed across many different DRAM banks, similar to avoiding shared memory bank conflicts. For example, if you have a kernel that accesses an array of size N, you can chop the array up into M pieces of size N/M. If you assign CTAi to process elements [(Ni/M), N(i+1)/M), the DRAM mapping stride is N/M, and you have one CTA per DRAM module, then each CTA will access data from only one DRAM module. A program written like this will be scalable with the number of DRAM modules and GPU cores.

Conversely, as Dominik mentioned, if all of the threads in your program access memory that is mapped to a single DRAM module, your bandwidth will be limited to that of a single DRAM module.