How are cuBLAS and cuDNN being so fast that not even cuTLASS or any of the tensorflow/pytorch approaches or kernels designed by the developers’ guidelines succeed in reaching or reproducing their performance?
I know that they are both designed and implemented by hardware and software experts and that every company has its own secrets and intentions to keep their software the best on market. I understand that they achieve the absolute best in maximizing coalesence at global memory, minimizing bank conflicts at shared memory, perfecting the distribution of warps and blocks, achieving the perfect balance in register count and so on, but I have the feeling that there is something else than just the developers’ gudelines, CUDA and PTX.
A hardware vendor typically has more architectural knowledge available inside the company than is made available to the public, and this can be exploited for optimizations. Regardless of platform, vendor-supplied libraries also often include hand-written assembly/machine code tuned for specific processor microarchitectures.
NVIDIA does not make machine-code level tools available to the public, one reason presumably bbeing that there are too many differences between GPU architectures. These are abstracted away in PTX, but the consequence is that PTX must be compiled into machine code, so despite what the name might suggests, PTXAS is an optimizing compiler, not an assembler.
You can trawl the internet for hobbyist / research-level machine code assemblers for NVIDIA GPUs based on reverse engineering.
It’s a bit dated (basically had maxwell architecture in view), but a ninja in the community named Scott Gray did create an assembler (maxas) and also documents the work done to beat CUBLAS.
My opinion is that the work Scott did demonstrated that with that level of effort, the performance level achieved by CUBLAS could be met or exceeded by coding in that style.
Going back to the original question: Can you beat CUBLAS or CUDNN with CUDA as shipped? With high probability, the answer is “yes”, for particular use cases.
A problem with libraries is that they cannot be optimal for every use case. Even though CUBLAS contains dozens of different GEMM kernels under the hood, there is on average one forum post per year where someone reports that their compiled CUDA code beat CUBLAS on GEMM for a particular combination of matrix sizes, matrix element type, matrix aspect ratios, transpose modes, and GPU architecture.
The other issue with libraries is that no vendor has the resources to apply ninja-level optimizations to every function in the library. Optimization efforts are focused on those functions considered to be the most performance critical or the most commonly used. So someone who focuses on lesser-used library functions has a good chance of beating the library on performance, assuming they have enough skill to tackle such optimizations.
When exploring the SASS of cuBLAS.SGEMM I saw constellations that I had no idea how can be reproduced with CUDA, PTX and vectorization techniques. The suggestion to manipulate *.cubin output via MAXAS and then load and use it sounds acceptable, but is only supported for Maxwell. At least now I understand that one needs an assembler to reproduce the perfectly minimized quantities of LDG.128, LDS.128, STS.128, STG requests and transactions’ rates of ≤ 2 at global and shared memory levels that NVidia has achieved.
Thank you for the clarifications.
Really impressive work. Thank you for the references.
It should be trivial to get 128-bit load and store instructions out of compiled CUDA code. Just use 128-bit data types like float4, double2, etc. Note: Since loads and stores must be naturally aligned on the GPU, care must be taken when casting a float to a float4 pointer etc.
You are absolutely right, but since shared memory is divided into 32 banks and 32 bit consecutive words are assigned to consecutive banks and according to my observation (which may be wrong) the use of vectorization leads in some cases to more bank conflicts due to float4 requesting 4 transactions per request.
Lets assume that we have 1 block with 32 threads and 16 threads pro row. Each thread has to load 8 values from shared into register memory. The following kernel (1 block, 32 threads), profiled via nsight, shows that each warp uses 2 x LDS.128 and 4 transactions pro request. cuBLAS succeeds in doing this with only 2 transactions.
(I’m actually impressed that the compiler recognizes these and converts to a vector load under the hood. I would ordinarily advise cuda programmers to explicitly do a vector load, but that doesn’t seem to be at issue here.)
CUBLAS cant do any better than that. For each request, up to 128 bytes (or 256 in the case of cc3.x) can be served up by shared memory per transaction. Each request requires 512 bytes (warp-wide), so on architectures other than cc3.x, four transactions per request (i.e. per LDS.128 instruction) will be required. On cc3.x it may be possible to witness this to be at 2 transactions per request.
These statements assume a full active warp of 32 threads. (One block with 32 threads will have one warp with 32 threads)