cublasSsyr2k is too slow (37 Gflop/s)

So much have been said about the performance of SGEMM, but what about other BLAS3 routines? I get only 37 Gflop/s in SSYR2K on my GeForce 8800 GTX (that is 3.7s for n=k=4096). This routine does not differ much from SGEMM (and, in fact, can be implemented using SGEMM), but runs 3.3 times slower, why?!

For comparison, both SGEMM and SSYR2K in Intel MKL 9.1 run at ~12 Gflop/s on Core2 Duo 2.66 GHz.

Please correct me if I got it wrong.