Improving the performance of CUBLAS 2.2

Hi, everybody

Recently we have designed a method to improve the performances of some
BLAS-3 routines from cublas 2.2. Our method applies to those
routines with lower performances: symm, syrk, syr2k, trmm, trsm.

The method is described in detail in report:
http://www.hpca.uji.es/gpgpu/CUDA_ZONE/FLAWN37.pdf

  • We have developed just one case for every routine, but the method
    will also work for the other cases (each cublas routine includes
    several cases: for instance, A transpose, B no transpose, etc.).
  • We have also developed the codes for single-precision, but
    the same idea can be easily translated to double-precision.

The best speedups will show in T10 cards.
In our T10 cards, the speedup of the new codes when compared
against cublas 2.2 is between 1.5 and 4 on large square matrices.
The speedup of the new codes when compared against cublas 2.2 is
between 2 and 12 on large rectangular matrices.
We have noticed all our new codes attain about 300 Gigaflops for
large square matrices on T10 card, not too far from gemm performances.

The report describes in detail how to develop the codes for the specific
cases we studied.
Applying the same method for the other cases is not much difficult since
we have worked at high-level: we got very high-performances without
touching cuda language at all except one.
For every case, we tested the three usual variants (matrix-panel,
panel-matrix, and panel-panel) on some block sizes and we selected
the best ones.

We can share the codes for the cases we have already written (if your
application is non-commercial). We cannot promise anything for the other
cases as we do not have any funding.

If someone is interested, contact by email to the address below.

fran, greg, and robert

% =======================================
Francisco Igual Peña
figual@icc.uji.es
Dept of Computer Science and Engineering
University Jaume I
12.071 - Castellon (Spain)
% =======================================