I recently read an article here from the Sandia National Laboratory, which states that more chip cores can mean slower supercomputing. They concluded this on the basis of the simulation of some supercomputing applications that they carried out. The reason as per the study is the memory wall and contention for the resources. But I have seen NVIDIA’s website that shows speed-up of x400 on Tesla with 240 cores. Doesn’t that contradicts the study? Does that mean CUDA will not hit the memory wall (at least in near future)? If so why?
I think NVIDIA people can help me understand this better.
After skimming through the article, the conclusion seems “you can’t just add cores, you need to make memory faster as well”. That’s not really ground breaking but true.
GPUs have higher memory bandwidth (sometimes by an order of magnitude as avidday mentioned) but it only works with proper access patterns and many kernels are bandwidth bound. So it’s not that CUDA is above and beyond the memory wall, they’ve just pushed it a bit further with smart engineering and a specialized architecture.
But, NVIDIA will have to continously push this wall further by the SMART ENGINEERING you refered. But the question is can NVIDIA guarantee that the current softwares written for the present GPUs will also work on the future GPUs, because of this quest for the farther memory wall.
The memory wall is more of a problem for single core architectures than multicore architectures. For single core architectures, frequency scaling meant that memory latency was more important than bandwidth because a cache miss would mean the CPU stalling for a number of cycles that increased as you scaled the CPU frequency.
For multicore processors and GPUs, you can just add more memory banks and memory controllers and easily increase the memory bandwidth of the system. You couldn’t do this for single core because it would not improve memory latency. The main problem for multicore is pin bandwidth. Although we have not hit a limit yet, the number of io pins for a processor is more or less fixed and you have to move all of your data (~150GB/s on high end nvidia gpus) across those pins.
The memory wall as such is the problem to get data from off-chip memory (DRAM) to the processing cores.
On CPUs, AMD had the lead for the last two years, because they included the memory controller (the bit of hardware that actually fetches data) on the die. NUMA. Each socket in a multi-CPU system had full memory bandwidth as long as each core in a socket accessed “local” data, data local to each socket. Intel stuck to the north/southbridge arch. With Core i7, Intel took the lead again by migrating to NUMA; at least in my microbenchmarks. The relevant point is that any NUMA arch scales memory performance by socket.
The interesting observation is that GPUs behave similarly: Replace socket by memory partition. There is a wicket crossbar switch so GPUs are not hardware-NUMA, but the effect on performance is the same. Full bandwidth (of 160+ GB/s) is only available if all “cores” or warps or multiprocessors access different memory banks in different memory partitions. Check the transpose sample in recent tutorials (probably also in the SDK, didn’t check) for “partition camping”: If all memory accesses hit the same memory partition, then your memory performance plummets.
In short: GPUs suffer from the memory wall just in the same way as CPUs. Just on a much higher level.
At least with NUMA architectures, you have a chance to address the problem at the algorithmic level. From the way that the memory wall was originally defined:
With single threaded, single core CPUs, there is nothing that you can do at the algorithmic level. Your program will have some finite number of cache misses, and Amdahl’s law says that if the time it takes to service these misses becomes significantly larger than the time it takes to process an instruction, servicing cache misses will dominate your application’s performance.
In your transpose example, you can assign different CTAs to different memory partitions, and as long as your program has this property, the performance can be improved by adding more GPU cores and more memory partitions.
I think that it is valuable to remember this point from the paper:
In their case the unstated assumption was that programs were single threaded and that there was a single cpu core and memory partition. Even if writing applications is more complex because the programmer has to take into account which GPU core accesses which memory partition, NUMA memories and multicore CPUs/GPUs seem to offer a way over the memory wall until we hit the next “wall”.
I mean that you can write scalable code such that the memory accesses are distributed across many different DRAM banks, similar to avoiding shared memory bank conflicts. For example, if you have a kernel that accesses an array of size N, you can chop the array up into M pieces of size N/M. If you assign CTAi to process elements [(Ni/M), N(i+1)/M), the DRAM mapping stride is N/M, and you have one CTA per DRAM module, then each CTA will access data from only one DRAM module. A program written like this will be scalable with the number of DRAM modules and GPU cores.
Conversely, as Dominik mentioned, if all of the threads in your program access memory that is mapped to a single DRAM module, your bandwidth will be limited to that of a single DRAM module.