# cusparseScsrmv performance

Hi,
looking on cusparse performance I have found some strange issue.
Invocating cusparseScsrmv function:

cusparseStatus_tÂ
cusparseScsrmv(
Â Â Â Â cusparseHandle_tÂ handle,Â cusparseOperation_tÂ transA,
Â Â Â Â intÂ m,Â intÂ n,Â floatÂ alpha,
Â Â Â Â constÂ cusparseMatDescr_tÂ *descrA,
Â Â Â Â constÂ floatÂ *csrValA,
Â Â Â Â constÂ intÂ *csrRowPtrA,Â constÂ intÂ *csrColIndA,
Â Â Â Â constÂ float *x,Â floatÂ beta,
Â Â Â Â floatÂ *yÂ )

When “float beta” parameter is set to “1.” it multiplies ~10% faster than with beta set to “0.”.

Assuming:
y = alpha * op (A) * x + beta * y

parameter beta=“0.” should give less work to GPU.

Why cusparseScsrmv acts this way?

not exactly, it depends on implementation.

if you do

y := beta * y

y2 := alpha * op (A) * x + y

then you have to zero out y1 if beta is zero,

but do nothing if beta is 1.0

Not in this case.

According to the function documentation (page 36)

“beta - scalar multiplier applied to y. If beta is zero, y does not have to be a valid input”.

This implies that the implementation should test if beta is == zero and only if it is not is it allowed to do the beta*y multiplication (otherwise the implementation would risk a segfault or other disasters on y).

Also, as far as I know, the reason of implementing such general “multiple argument” functions in libraries is to minimize memory transfers by doing the job in just one step.

So, if beta == 1, one could (should) skip beta * y, but one still has to read y from the memory to complete step 2.

If beta == 0, you can forget about y altogether.

Thus beta == 0 should run faster, especially for this bandwidth-limited kernel.

Note that y is an input as well as an output (i.e. it’s a read/write argument), so the pointer better be valid otherwise there is a bigger issue. What the specification states is that if beta is zero, the input data at y can be anything without affecting the result. For example, the initial data stored in y could consist of NaNs, which would normally propagate to the result even if multiplied by zero. The specification guarantees that these NaNs would be ignored if beta is zero.

In other words, what LSChien stated about the two-stage computation process applies, i.e. beta of zero requires memset’ting y to zero in the first step, while the first step result in no operation if beta is one, explaining the seemingly paradoxical performance characteristics observed. There are various advantages to using this two-stage computation approach for operations on sparse data.