cublas sgemm questions some speed and memory details

mattb3 · April 23, 2008, 4:14am

I have been a lurker on these forums for several months, and since this is my first post I would like to give a big thank you to the NVIDIA developers and other forum members for enabling the use of some very powerful hardware so easily. As a Matlab and algorithm guy, I have found the transition to C and CUDA fairly painless and enjoyable. We have had almost unbelievable success in implementing signal processing routines using CUDA on Tesla’s. We’re in that rare upper-right hand corner of faster AND (much) cheaper. This success is certainly due to the efficient and easy to use CUDA API, not to mention the developer community on these forums. That out of the way, here are two SGEMM questions that I haven’t found on these forums yet.

Is it safe to use SGEMM with in-place memory. For example, an easy way to perform the sum of each column(or row) of a matrix is to multiply by a ones vector of the appropriate length. If A is mxn and x is nx1 consisting of ones, then y = A*x is the summation across each row of A. This operation is very useful for easily computing means and other low-level utilities. When implementing a stand alone function called rowSum or whatever, where the only memory pointers are the input A and the output y, it’s desirable to NOT have to pass in or allocate the ones vector associated with this function call. It turns out that simply by initializing the output memory to ones first then calling SGEMM where the arguments B and C point to the same memory location, the ones vector is overwritten in-place with the correct output (assuming care is taken that the output memory allocation is sufficient.)
So in summary calling SGEMM in-place where one of the outputs overwrites one of the inputs seems to work for our case. My question is, is this safe and guaranteed to work? Is this usage something that is common for BLAS routines or am I just getting lucky? I have actually tried to look at the cublas source, but there’s a least a dozen SGEMM code paths, none of which I understand.
Does anyone know if cublas SGEMM and/or CGEMM speed enhancements are planned in the 2.0 or later releases. I’m asking because probably 60% of our compute time is spent in CGEMM, and based on the fast SGEMM optimizations on this forum, there is clearly some room to improve, at least in certain situations. I guess I’m wondering if further optimizing of these routines is on anyone’s todo list.

mfatica · April 23, 2008, 5:01am

it is unsafe. C should not overwrite A or B.
One of the code paths in CUBLAS 2.0 beta ( that you can download right now) is calling the fast code by Volkov when N is a multiple of 64.

Topic		Replies	Views
in-place cublasSgemm CUDA Programming and Performance	3	10036	January 1, 2010
CUBLAS In place? CUDA Programming and Performance	6	4157	January 30, 2012
CUBLAS sgemm overwrite input with result CUDA Programming and Performance	2	3548	April 16, 2010
A newbie question on cublasSgemm CUDA Programming and Performance	6	4947	May 14, 2008
Matlab mex file using cublas - problems CUDA Programming and Performance	13	9065	October 13, 2009
CUBLAS matrix-vector multiplication CUDA Programming and Performance	14	10217	January 20, 2010
Matrix multiplication CUDA Programming and Performance	3	3830	March 6, 2008
my speedy SGEMM CUDA Programming and Performance	91	276278	May 29, 2013
cublas sgemm benchmarks CUDA Programming and Performance	1	3688	July 9, 2008
beginner CUBLAS Sgemm question CUDA Programming and Performance	2	1714	March 9, 2010

cublas sgemm questions some speed and memory details

Related topics