Low Level CUDA C Programming Education

I am having a bit of a challenge finding clear marketing free documentation of the lowest level C and ASM programming of CUDA. My application is not amenable to common libraries, but is a massive number of MAC instructions. We have a naïve CUDA implementation, but I need to find information on how to optimize our approach. Currently we are severely memory access bound. I just can’t get past the documents marketing laden jargon and find clear declarative engineering descriptions of the architecture, inline assembly and how to optimize/profile at the lowest level. Any pointers to clearer documentation and tutorials would be greatly appreciated.
I’m so frustrated with the docs and current performance that I’m ready to toss GPUs and just drop in an FPGA. FWIW, Intel’s docs are even more deeply saturated with marketoid speak.

For general introductory CUDA education, I recommend this series. It is certainly intended to be “clear and devoid of marketoid speak”. However if you feel that terminology like “coalesced” or “streaming multiprocessor” is “marketoid speak” then I don’t know what to suggest.

CUDA assembly code is SASS. There is no toolchain support and no viable methodology to do programming in CUDA where you expect to provide SASS as your source code.

The next level above that is PTX. I personally would not recommend PTX but it is documented here. You certainly can do all your programming in PTX if you wish, and you can also use PTX as “inline assembly” which is useful in certain situations. PTX is a language that is applicable to an abstract machine model. It then gets compiled to CUDA SASS in actual usage.

Above that you have CUDA C++.

The modern CUDA profilers are nsight compute and nsight systems. You can get some introductory treatment here for nsight compute and here for nsight systems. For the things you seem to be interested in, nsight compute is the proper profiler. For a deeper exposure of typical usage of nsight compute in a profiler-driven optimization process, see here.

Again, if all of that is “saturated with marketoid speak” then I apologize, I have no other suggestions.

This strongly suggests that low-level programming is not what is going to help improve performance for this use case. Using the profiler to extract memory-specific metrics is, followed by work at the level of algorithms and data structures. Since it is common for CUDA code to be limited by effective memory throughput (on the hardware side, it is easier to boost FLOPS than increase RAM bandwidth), this topic is covered fairly extensively in NVIDIA’s docs and supporting material.

In situations where performance is bound by instruction throughput, I would suggest starting with studying the SASS code produced by the toolchain (e.g. from cuobjdump --dump-sass). Fair warning: Newer architectures use a lot of somewhat unusual instructions designed to make optimal use of a three-input datapath, but hinder code comprehension by human readers. For example, you may find that all logical operations are mapped to LOP3 instructions using a many-to-one mapping. The compiler can also massively transform code so that it can be difficult to match up specific SASS instructions with corresponding HLL source code.

The good news is that when it comes to floating-point operations, the few instructions available are largely self explanatory.

It can be instructive to observe how changing HLL source code leads to changes in the SASS; and this may lead to a preference of some source code idioms over others. This is a pretty brittle approach however.

As for programming at PTX level, I would advise against it except as a measure of last resort. PTX is assembly code for a virtual ISA (and also a compiler intermediate format) and instructions at that level may well be emulated on some architectures. The reasons it is necessary is because there is no binary compatibility between the ISA of different GPU architectures.

Unless you have a lot of experience designing with FPGAs, I would advise against this. If you have the relevant experience, by all means give it a try. You may even get an interesting publication out of it if your use case has some novel aspects. GPUs are not a cure-all, they are just another tool in the toolbox. A widely applicable one, I would claim. Knowing which tool is the most appropriate for a given task is part of ensuring engineering success.