I have been a lurker on these forums for several months, and since this is my first post I would like to give a big thank you to the NVIDIA developers and other forum members for enabling the use of some very powerful hardware so easily. As a Matlab and algorithm guy, I have found the transition to C and CUDA fairly painless and enjoyable. We have had almost unbelievable success in implementing signal processing routines using CUDA on Tesla’s. We’re in that rare upperright hand corner of faster AND (much) cheaper. This success is certainly due to the efficient and easy to use CUDA API, not to mention the developer community on these forums. That out of the way, here are two SGEMM questions that I haven’t found on these forums yet.

Is it safe to use SGEMM with inplace memory. For example, an easy way to perform the sum of each column(or row) of a matrix is to multiply by a ones vector of the appropriate length. If A is mxn and x is nx1 consisting of ones, then y = A*x is the summation across each row of A. This operation is very useful for easily computing means and other lowlevel utilities. When implementing a stand alone function called rowSum or whatever, where the only memory pointers are the input A and the output y, it’s desirable to NOT have to pass in or allocate the ones vector associated with this function call. It turns out that simply by initializing the output memory to ones first then calling SGEMM where the arguments B and C point to the same memory location, the ones vector is overwritten inplace with the correct output (assuming care is taken that the output memory allocation is sufficient.)
So in summary calling SGEMM inplace where one of the outputs overwrites one of the inputs seems to work for our case. My question is, is this safe and guaranteed to work? Is this usage something that is common for BLAS routines or am I just getting lucky? I have actually tried to look at the cublas source, but there’s a least a dozen SGEMM code paths, none of which I understand. 
Does anyone know if cublas SGEMM and/or CGEMM speed enhancements are planned in the 2.0 or later releases. I’m asking because probably 60% of our compute time is spent in CGEMM, and based on the fast SGEMM optimizations on this forum, there is clearly some room to improve, at least in certain situations. I guess I’m wondering if further optimizing of these routines is on anyone’s todo list.