ROP and gpgpu


The render output unit - does it play any role if you use the card only for gpgpu?

I refer to the difference between Titan X and GTX1080ti with 96 versus 88 rop’s.

In short: is it the number of cachelines that can get written simultaneously to the GDDR5 or does it only work for graphics and not for gpgpu (or something else)? In that first case how many bytes does each ROP contain prior to writing?

The CROP/ZROP units are not used by the compute engine.

The 1080ti is missing 1 of the 12 L2 controllers which has the following clock to clock impact between a Titan X and a GTX1080ti:

  • 1/12 less L2 bandwidth
  • 1/12 less L2 capacity
  • 1/12 less global atomic performance
  • 1/12 less bandwidth to DRAM

The L2 and the memory clock are different between the two GPUs so actual perforamnce difference may be more or less than what is listed above.

thanks Greg, very good you mention that one, as the L2 performance is very important to understand perfectly well.

How many different locations can get kept open to simultaneously to the GDDR5 ram? At the current papermodel here a Titan X would be having 8 warps simultaneously at each SM, meaning that there gets written to 8 different GDDR5 locations from each SM simultaneously. That times 28 SM’s would mean: 28 * 8 = 224 streams each to a different location get written. Of course the 32 cuda cores in each warp write coalescent. Would that get the full bandwidth from the GDDR5?

At tests at the GTX980 i didn’t see big improvement moving from 8 to more warps simultaneously on the SM, that was WITHOUT streaming to the GDDR5.

I’ll be writing a benchmark for sure. When i wrote some benchmarks for supercomputers that was a big shock some years ago - as it appeared back then (2002-2003 it was) that they had been optimizing for some stupid tests executed from a single node instead of realworld performance from all cores at the same time :)

I wonder what the expectation there is for the different GPU’s and also what size buffer to stream to maximize bandwidth :)

You cannot get the full bandwidth of GDDR5 GPU memory, just like you cannot get the full bandwidth of a DDR4 system memory in a benchmark. Expect to max out at around 80% of the theoretical bandwidth. The rules for maximum bandwidth are basically: (1) All accesses coalesced (2) Each thread makes 128-bit accesses (best use of limited-depth load/store queue). The simple kernel below will do that (configure to taste e.g. blocks = 65520, treads/block = 128, len=100000000).

__global__ void zcopy (const double2 * __restrict__ src, double2 * __restrict__ dst, int len)
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] = src[i];

Note that various performance issues have been reported with GDDR5X memory (search this forum for details).

With ‘full bandwidth’ i do not mean the mentionned bandwidth which includes of course all sorts of CRC’s and non-user data type bits - yet what is possible to achieve by 28x8 warps each writing to a different memory location and streaming to that location as compared to the above test :)

It is very good that you mention the 128 bits per thread as a good starting point for the test to write :)

i’ll also experiment with larger chunks of memory of course up to several kilobyte, prior to switching to a different write address.

Yet writing to 28x8 different memory locations at the same time is the tough thing of course. At a CPU that would kill it.

Total working set size a few gigabyte.

By the way refresh my mind - “restrict” in the above function you posted - that doesn’t mean it’s the read-only cache is it? (at Kepler GPU)

As you also write to ‘restrict’. What does ‘restrict’ mean when you write to it?

With some GPUs (no idea about the GTX 1080Ti) you need many thread blocks to achieve full bandwidth. In other words if 28x8 warps is all you are running in the entire processor, it is not likely you will see anywhere close to full bandwidth. Rule of thumb: Shoot for: total blocks = 20 x (thread blocks running concurrently in the GPU). That’s probably at least hundreds of blocks in the grid.

Sounds like a bad idea. GPU memories are optimized for streaming, not random access. If you jump all over memory, you may wind up with very poor throughput.

Suggestion: Better stop with paper designs and start with experiments.

restrict’ is equivalent to C99 ‘restrict’. For some reason the C++ folks refuse to add this modifier to the ISO C++ standard, which is why it has to be implemented as a toolchain-specific feature.

This is a promise the programmer gives to the compiler that access to the underlying data object is restricted to this pointer. In other words, a promise that there is no aliasing. If the programmer lies to the compiler, the resulting code may not work as desired. Restricted pointers often allow the compiler to re-order loads and stores more aggressively. Recommended.

See also the Best Practices Guide.

Yes, i still noticed at Maxwell (GTX980) a great speedup when moving to 8 concurrent running warps at a SM.

For a FFT/NTT implementation crunching millions of bit prime numbers, i first will obviously try to do without each warp requiring a couple of kilobytes of the L1 datacache. In that magnificent case that i get that to work, it will be possible to throw more warps concurrently at the same SM (of course potentially concurrently, i assume the SM doesn’t actually execute that much concurrently when waiting for GDDR5 ram to flow to it).

Yet there is hard limits, i mean real hard limits in the number of warps i can launch. At the size of prime numbers i’m testing i’m looking at couple of megabytes storage in GDDR5 needed for each transform.

244 x 4 MB is already far over a GB and bandwidth latencies of GDDR5 (and all DDR memory) of course is much much faster if the total chunk of RAM you occupy is smaller.

So we can prove already pretty easy that 20 concurrent warps is not gonna be ideal for this, because the slowdown will be too much.

In other words, If we have 100MB ram in total at the GDDR5 where all 8x28 warps are busy with, then the total bandwidth you get out of that, always is a lot, considerable more, than you can get out of 1GB of RAM, provided you jump semi randomly through it (which is the case with FFT).

Because the size of the jump is a lot slower. That’s how the RAM works at hardware level simply.

Note that 4MB would equal about 2^22 bits = 4+ million bits number which is pretty little actually if you realize that all the Xeon CPU cores i got here are testing around 5 million bits as we speak :)

So total amount of GPU RAM versus maximum size of transform is gonna be an issue already - and gpu’s are in general not so happy if you eat up all RAM they got onboard.

Let’s not talk about OpenCL there which swaps you to the slow CPU ram then :)

Yeah thanks for the restricted refreshment. I tend to not use all those keywords in most codes i write.

Definitely must use it here for the GPU!

What do you mean by “not so happy”? Windows may prevent you from using close to all of the RAM on the board, and there was one case of pathological behavior (on GTX 970, I think?) where the last 10% or so of the RAM went through a slower access path, but other than that I am not aware of any issues with using all the memory you can possibly allocate.

If you compare GPU and CPU for use cases that use a relatively small amount of memory, keep in mind that CPUs come with very large caches these days that provide low-latency and high-throughput memory, and it the use cases is such that all processing can be done within the CPU caches, that use case might be a better match for the CPU (provided you use multi-threading and SIMD vectorization) than the GPU.

In general CPU and GPUs complement each other nicely in that they are optimized for different uses cases, and the programmer’s task is to find out which cases should run where. Thee is no point in running around with a hammer (the GPU) and treating every use case as a nail, even if it looks like a screw.

Ah i see why in the sieving kernels i didn’t need the restrict keyword at all.

I already restricted the number of function calls :)

Who needs function calls anyway if you can win another clock with it?

With ‘not so happy’ i refer to my experiments a few years ago - as it has been quite some years ago i toyed with larger RAM chunks - when i tried to allocate a gigabyte or so and then have all threads try to work at the same RAM.

All threads launched at the entire GPU yeah :)

All this is not so relevant now - i’m gonna write benchmarks to see how i can optimally use the GDDR5 for the transforms and vary all parameters and make nice graphs out of it.

Intention is not to make something to compare different gpu’s or manufacturers with each other. I just want to get the maximum out of the GPU i can humanly get out of it with my limited time for the project :)

In politics and commissions of the government i bet the word ‘gpu and cpu’ complementing each other is a nice find.

In reality the GPU is much faster and cheaper than the CPU a double precision gflop, so you figure out a way to get it to work somehow - or you go find another job.

It’s much tougher though to get things to work on a GPU of course as it is less generic than a CPU. It’s more dedicated to what it can do with a very limited instruction set and very very little cache.

I would argue a manycore is in no way complementing a CPU. It replaces it where the CPU is too expensive to deliver big performance.

p.s. besides that i need a compact heater that’s not too loud right now that it’s getting slowly colder in winter :)

No need to give a pep talk on the virtues of GPUs. I was among the first five people who worked on CUDA, worked on software for the CUDA eco-system for nine years, and helped some customers optimize their CUDA applications along the way. So when I say that there are some use cases that are better handled by CPUs, I am not making stuff up. With the progress made in both GPU architectures and specific implementations, there are fewer of such uses cases every year.

In my experience, to make the best use of a high-end GPU, you want a fast CPU to deal with the serial work that remains once the parallelizable portion has been accelerated with a GPU. My current recommendation is > 3.5 GHz base clock (so high single-thread performance) and ~4 CPU cores per GPU (assuming the important applications running on the machine are GPU accelerated, otherwise you may want more cores). And for every GB of GPU memory you would ideally get 4 GB of system memory with as many channels of DDR4-2400 as you can afford. If you look at NVIDIA’s DGX1, it’s built along similar lines (not that I had any input into the configuration of that machine)

CUDA programming is about balancing resources (warps, registers, shared memory). At 2 warps/SM scheduler the kernel is at maximum registers per thread (255). This sacrifices latency hiding for registers. If the warp does not need 255 registers/thread then I would recommend increasing the number of warps per SM to try to better hide latency.

On a gp102 there are 24 L2 slices and 12 memory controllers. In order to fully saturate L2 slice bandwidth each slice requires 1-2 32 byte reads and 1 32 byte write per cycle and eac memory controllers requires a 32 bytes read or write (based upon memory clock).

Interesting post Greg - you brought me onto idea. Onto a crabblenote i can get a lot of the code as i see it a huge ILP. that is, other than the loads from GDDR5. Yet i really do not see how to hide the GDDR5 latency other than launching at least 8 warps (of 32 cudacores) onto each SM.

There isn’t prefetch opcodes i can give the GPU hints with is it? That it is already preloading GDDR5 data while it’s busy executing “double precision” (or integer based ntt) instructions.

Using about a register or 8+ per cudacore in each warp of 32 cudacores reduces the load onto the L1 datacache considerable. By factor 3. Yet that is 64 bits doubles - so i assume that translates to 16 'registers. On top of that there is a need for loopvariables. The entire Fourier transform i do of course iterative.

Some of the loopvariables i simply cannot get rid of. Obviously that’s integers. Maybe i can get away with 20 registers of 32 bits then.

So there would be a potential for 12 warps on each SM. Note that 8 warps scheduled on each SM will eat up 32KB L1 datacache or so. I might want to schedule 2 different kernels additional to that onto each SM, which do other work. That is - if i still have left some L1 datacache :)

We’ll see from the tests what is best on each GPU architecture. That will eat months for sure as i can’t fulltime work on this. As you might notice it’s 2:43 AM here already. In the end the story is about how much bandwidth i can effectively use from the GDDR5 to stream data to and from the SM’s given the semi ‘random’ addresses one has to stream from. After some indications are there it’s easier to know what sort of speed the transform can get in the public manner how i’m gonna implement it…