csrmm: SP is slower than DP?

I have a few cusparse csrmm (C=op(A)*B, A sparse, B,C dense, where I have set op(A)=CUSPARSE_OPERATION_NON_TRANSPOSE) operations in an algorithm, and for each instance (of which the matrices are different sizes and sparsity structures), a single precision execution runs slower than the equivalent double precision version.
Average timing SP : 0.285s
Average timing DP : 0.261s

Does anyone have any idea what on earth is going on here?