Is cuBLAS/cuDNN performance somehow reachable or reproducable with CUDA?

timespace · December 24, 2019, 5:55pm

Hello,

How are cuBLAS and cuDNN being so fast that not even cuTLASS or any of the tensorflow/pytorch approaches or kernels designed by the developers’ guidelines succeed in reaching or reproducing their performance?

I know that they are both designed and implemented by hardware and software experts and that every company has its own secrets and intentions to keep their software the best on market. I understand that they achieve the absolute best in maximizing coalesence at global memory, minimizing bank conflicts at shared memory, perfecting the distribution of warps and blocks, achieving the perfect balance in register count and so on, but I have the feeling that there is something else than just the developers’ gudelines, CUDA and PTX.

Happy Holidays and Best Wishes,
James.

njuffa · December 24, 2019, 6:37pm

A hardware vendor typically has more architectural knowledge available inside the company than is made available to the public, and this can be exploited for optimizations. Regardless of platform, vendor-supplied libraries also often include hand-written assembly/machine code tuned for specific processor microarchitectures.

NVIDIA does not make machine-code level tools available to the public, one reason presumably bbeing that there are too many differences between GPU architectures. These are abstracted away in PTX, but the consequence is that PTX must be compiled into machine code, so despite what the name might suggests, PTXAS is an optimizing compiler, not an assembler.

You can trawl the internet for hobbyist / research-level machine code assemblers for NVIDIA GPUs based on reverse engineering.

Robert_Crovella · December 24, 2019, 6:43pm

It’s a bit dated (basically had maxwell architecture in view), but a ninja in the community named Scott Gray did create an assembler (maxas) and also documents the work done to beat CUBLAS.

https://github.com/NervanaSystems/maxas

https://github.com/NervanaSystems/maxas/wiki/SGEMM

My opinion is that the work Scott did demonstrated that with that level of effort, the performance level achieved by CUBLAS could be met or exceeded by coding in that style.

njuffa · December 24, 2019, 7:22pm

Going back to the original question: Can you beat CUBLAS or CUDNN with CUDA as shipped? With high probability, the answer is “yes”, for particular use cases.

A problem with libraries is that they cannot be optimal for every use case. Even though CUBLAS contains dozens of different GEMM kernels under the hood, there is on average one forum post per year where someone reports that their compiled CUDA code beat CUBLAS on GEMM for a particular combination of matrix sizes, matrix element type, matrix aspect ratios, transpose modes, and GPU architecture.

The other issue with libraries is that no vendor has the resources to apply ninja-level optimizations to every function in the library. Optimization efforts are focused on those functions considered to be the most performance critical or the most commonly used. So someone who focuses on lesser-used library functions has a good chance of beating the library on performance, assuming they have enough skill to tackle such optimizations.

timespace · December 25, 2019, 1:50pm

When exploring the SASS of cuBLAS.SGEMM I saw constellations that I had no idea how can be reproduced with CUDA, PTX and vectorization techniques. The suggestion to manipulate *.cubin output via MAXAS and then load and use it sounds acceptable, but is only supported for Maxwell. At least now I understand that one needs an assembler to reproduce the perfectly minimized quantities of LDG.128, LDS.128, STS.128, STG requests and transactions’ rates of ≤ 2 at global and shared memory levels that NVidia has achieved.

Thank you for the clarifications.

Really impressive work. Thank you for the references.

njuffa · December 25, 2019, 3:39pm

It should be trivial to get 128-bit load and store instructions out of compiled CUDA code. Just use 128-bit data types like float4, double2, etc. Note: Since loads and stores must be naturally aligned on the GPU, care must be taken when casting a float to a float4 pointer etc.

timespace · December 25, 2019, 6:16pm

You are absolutely right, but since shared memory is divided into 32 banks and 32 bit consecutive words are assigned to consecutive banks and according to my observation (which may be wrong) the use of vectorization leads in some cases to more bank conflicts due to float4 requesting 4 transactions per request.

Lets assume that we have 1 block with 32 threads and 16 threads pro row. Each thread has to load 8 values from shared into register memory. The following kernel (1 block, 32 threads), profiled via nsight, shows that each warp uses 2 x LDS.128 and 4 transactions pro request. cuBLAS succeeds in doing this with only 2 transactions.

__shared__ float b[2][128];
float reg[8];

uint lane = threadIdx.x % 32;
uint row  = lane / 16;
uint col  = (lane % 16) * 4;

float4* b_v4 = reinterpret_cast<float4*>(&b[row][col]);
reg[0] = b_v4[0].x;
reg[1] = b_v4[0].y;
reg[2] = b_v4[0].z;
reg[3] = b_v4[0].w;
reg[4] = b_v4[32].x;
reg[5] = b_v4[32].y;
reg[6] = b_v4[32].z;
reg[7] = b_v4[32].w;

Perhaps there is a solution that solves this case with 2 requests pro warp and 2 transactions pro request? On computability 3+ transaction size at shared memory level is up to 256B https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/memorystatisticsshared.htm

Robert_Crovella · December 25, 2019, 8:55pm

The only shared memory subsystems that support 256 bytes per transaction are cc3.x. cc5.x and higher do not.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-3-0
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-5-x

Furthermore, two requests (i.e. two LDS.128 instructions) will be required, one for:

reg[0] = b_v4[0].x;
reg[1] = b_v4[0].y;
reg[2] = b_v4[0].z;
reg[3] = b_v4[0].w;

and one for:

reg[4] = b_v4[32].x;
reg[5] = b_v4[32].y;
reg[6] = b_v4[32].z;
reg[7] = b_v4[32].w;

(I’m actually impressed that the compiler recognizes these and converts to a vector load under the hood. I would ordinarily advise cuda programmers to explicitly do a vector load, but that doesn’t seem to be at issue here.)

CUBLAS cant do any better than that. For each request, up to 128 bytes (or 256 in the case of cc3.x) can be served up by shared memory per transaction. Each request requires 512 bytes (warp-wide), so on architectures other than cc3.x, four transactions per request (i.e. per LDS.128 instruction) will be required. On cc3.x it may be possible to witness this to be at 2 transactions per request.

These statements assume a full active warp of 32 threads. (One block with 32 threads will have one warp with 32 threads)

Topic		Replies	Views
Finding suitable cuBLAS function and half-spaces swap algorithm strategy discussion GPU-Accelerated Libraries	5	713	October 12, 2021
Shared memory matrix multiplication not working CUDA Programming and Performance	6	61	October 11, 2024
cuBLAS kernels always run serially despite streams and AsyncMemCpy?!? CUDA Programming and Performance	17	5825	September 30, 2015
Why is 2-D convolution slower than the matrix product? CUDA Programming and Performance	17	6753	April 18, 2015
cuBLAS call from kernel in CUDA 10.0 GPU-Accelerated Libraries	9	4848	April 7, 2021
Possible CUDA improvements CUDA Programming and Performance	7	6123	July 14, 2008
The best input layout settings in CuBlas GPU-Accelerated Libraries cublas	4	210	August 27, 2024
Mixing CUDA and CUBLAS possible? Is avalaible the CUDA source code? CUDA Programming and Performance	11	12888	May 8, 2010
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13935	September 5, 2008
CUDA memory transactions CUDA Programming and Performance	9	8826	April 11, 2011

Is cuBLAS/cuDNN performance somehow reachable or reproducable with CUDA?

Related topics