cublas sgemm questions some speed and memory details

I have been a lurker on these forums for several months, and since this is my first post I would like to give a big thank you to the NVIDIA developers and other forum members for enabling the use of some very powerful hardware so easily. As a Matlab and algorithm guy, I have found the transition to C and CUDA fairly painless and enjoyable. We have had almost unbelievable success in implementing signal processing routines using CUDA on Tesla’s. We’re in that rare upper-right hand corner of faster AND (much) cheaper. This success is certainly due to the efficient and easy to use CUDA API, not to mention the developer community on these forums. That out of the way, here are two SGEMM questions that I haven’t found on these forums yet.

  1. Is it safe to use SGEMM with in-place memory. For example, an easy way to perform the sum of each column(or row) of a matrix is to multiply by a ones vector of the appropriate length. If A is mxn and x is nx1 consisting of ones, then y = A*x is the summation across each row of A. This operation is very useful for easily computing means and other low-level utilities. When implementing a stand alone function called rowSum or whatever, where the only memory pointers are the input A and the output y, it’s desirable to NOT have to pass in or allocate the ones vector associated with this function call. It turns out that simply by initializing the output memory to ones first then calling SGEMM where the arguments B and C point to the same memory location, the ones vector is overwritten in-place with the correct output (assuming care is taken that the output memory allocation is sufficient.)
    So in summary calling SGEMM in-place where one of the outputs overwrites one of the inputs seems to work for our case. My question is, is this safe and guaranteed to work? Is this usage something that is common for BLAS routines or am I just getting lucky? I have actually tried to look at the cublas source, but there’s a least a dozen SGEMM code paths, none of which I understand.

  2. Does anyone know if cublas SGEMM and/or CGEMM speed enhancements are planned in the 2.0 or later releases. I’m asking because probably 60% of our compute time is spent in CGEMM, and based on the fast SGEMM optimizations on this forum, there is clearly some room to improve, at least in certain situations. I guess I’m wondering if further optimizing of these routines is on anyone’s todo list.

  1. it is unsafe. C should not overwrite A or B.

  2. One of the code paths in CUBLAS 2.0 beta ( that you can download right now) is calling the fast code by Volkov when N is a multiple of 64.