CUDA compiler bug or user error?

For a reduce (sum) operation, does it make sense that the technique could achieve an 8x improvement due to hiding memory latency?

To give useful feedback I would have to actually read the relevant section of the book, which I am frankly not prepared to do at this time (I don’t own a copy, for starters).

The “less threads can be better” strategy is best explained in the original presentation by Vasily Volkov. This is ancient but still relevant. “Better Peformance at lower occupancy”

This is arguably the most useful and impactful presentation ever given at any GTC.
It’s a subtle, high level, programming technique… not for beginners, but not as deep as say a master ninja writing his own SASS assembler to precisely optimize inner loop instructions.

Weird. I thought I had already pointed to Volkov’s paper. Must have been in a different thread …

No worries, this has been a very helpful discussion :)

@SPWorlwy, thanks for the paper, I’ll definitely check it out. It sounds like the right level for what I’m looking for right now.

If you ever feel like doing a technical deep dive, consider reading Volkov’s PhD thesis:

Vasily Volkov, “Understanding Latency Hiding on GPUs”. PhD Thesis, UC Berkeley, August 2016

Looks great, I’ll check it out too. The slides we’re very illuminating. The book I mentioned attempts to explain those concepts but did a poor job of it.

Out of curiosity, how close are the cuBLAS functions to theoretical peak performance? Can one beat them with just cuda using volkov’s approach? I profiled gemm once and noticed it was only achieving 25% occupancy on my GTX 980 ti (Maxwell), but now I understand why that’s probably better than 100%.

I am also curious if knowing the problem dimensions at compile time opens up more optimization possibilities.

Volkov’s presentation is from 2010, 4 generations of hardware ago. NVidia’s modern cuBLAS uses similar interleaved parallel access techniques plus carefully crafted and tuned kernels, and performs above 90% of theoretical hardware FLOPS.

But… if you’re a master ninja and write your own SASS assembler, you can also write Maxwell/Pascal SGEMM code that reaches 98% of theoretical hardware FLOPS.

The important (for some definition of “important” :-) CUBLAS functions are written in handcrafted native assembly code (not PTX, which is just a portable virtual ISA that is compiled to SASS by the ptxas component of the CUDA compiler).

Generally speaking, 3rd party programmers will not be able to match the performance of these highly optimized implementations with compiled code. There a people who have reverse engineered the machine instruction set of various NVIDIA GPUs (I am aware of such attempts for Fermi, Kepler, and Maxwell) and written their own rudimentary assemblers for programming at the SASS level. With that and lots of domain knowledge and programming skill a programmer could craft superior code.

However, something as simple as a matrix multiplication has many specific sub-cases (think squarish vs tall-skinny matrices, for example). So when you call *GEMM, you are calling one of dozens of different implementations depending on which category this particular matrix multiply falls into, what GPU architecture you are using etc. On occasion people have been able to beat CUBLAS for a specific set of subcases that did not yet enjoy the benefit of a specialized implementation inside CUBLAS.

Historically, and in particular when CUBLAS was still ordinary compiled CUDA code, Mr. Volkov provided important guidance that was incorporated into CUBLAS. CUBLAS may even contain actual code written by him, check the list of credits in the CUDA docs, which lists all third-party open source code incorporated by CUDA with their licenses (as required by these OSI-approved licenses, typically variants of BSD or MIT).

6. Some of the cuBLAS library routines were written by or derived from code written by Vasily Volkov and are subject to the Modified Berkeley Software Distribution License as follows […]