CUBLAS 2.2 improved

Hi, everybody

Recently we have designed a method to improve the performances of some BLAS-3 routines from
cublas 2.2. Our method applies to those routines with lower performances: symm, syrk, syr2k, trmm, trsm.

The method is described in detail in report:
[url=“http://www.hpca.uji.es/gpgpu/CUDA_ZONE/FLAWN37.pdf”]http://www.hpca.uji.es/gpgpu/CUDA_ZONE/FLAWN37.pdf[/url]

  • We have developed just one case for every routine, but the method will also work for the other
    cases (each cublas routine includes several cases: for instance, A transpose, B no transpose, etc.).
  • We have also developed the codes for single-precision, but the same idea can be easily translated
    to double-precision.

The best speedups will show in T10 cards. In our T10 cards, the speedup of the new codes when
compared against cublas 2.2 is between 1.5 and 4 on large square matrices.
The speedup of the new codes when compared against cublas 2.2 is between 2 and 12
on large rectangular matrices. We have noticed all our new codes attain about 300 Gigaflops for
large square matrices on T10 card, not too far from gemm performances. See plots below.

Find attached two figures illustrating the performance and speedups of the
implemented routines, compared with the same routines in CUBLAS 2.2.
The performances shown correspond to the optimal algorithmic variant and block
size for each BLAS-3 routine.

Plots for square matrices:

Plots for rectangular matrices:

The report describes in detail how to develop the codes for the specific cases we studied.
Applying the same method for the other cases is not much difficult since we have worked
at high-level: we got very high-performances without touching cuda language.
For every case, we tested the three usual variants (matrix-panel, panel-matrix, and
panel-panel) on some block sizes and we selected the best ones.

We can share the codes for the cases we have already written (if your application is
non-commercial). We cannot promise anything for the other cases as we do not have any funding.

If someone is interested, contact by email to the address below.

fran, greg, and robert

% =======================================
Francisco Igual Peña
figual@icc.uji.es
Dept of Computer Science and Engineering
University Jaume I
12.071 - Castellon (Spain)
% =======================================

More specifically, the cases we have already developed are the following:

  • SSYMM: SIDE = ‘LEFT’; UPLO = ‘LOWER’.
  • SSYR2K: UPLO = ‘UPPER’; TRANS = ‘NO TRANSPOSE’.
  • SSYRK: UPLO = ‘LOWER’; TRANS = ‘NO TRANSPOSE’.
  • STRMM: SIDE = ‘LEFT’; UPLO = ‘LOWER’; TRANSA = ‘NO TRANSPOSE’; DIAG = ‘NO UNIT’.
  • STRSM: SIDE = ‘RIGHT’; UPLO = ‘LOWER’; TRANSA = ‘TRANSPOSE’; DIAG = ‘NO UNIT’.

% =======================================
Francisco Igual Peña
figual@icc.uji.es
Dept of Computer Science and Engineering
University Jaume I
12.071 - Castellon (Spain)
% =======================================