Worse performance with CUDA 4.0 in sparse matrix computations with COO format

I’m doing some tests with the SC2009 SpMV code (it’s in CUSP downloads section), and for the COO format (and subsequently, the HYB format too) I’m experimenting huge performance differences between CUDA version 3.0 (the previous I had installed) and 4.0. It happens for all the matrices, in some of then the difference is more than double, in others is just a 10-20%, but it’s always a very noticeable difference.

Any idea of what can be happening?

Regards.