why did my first cudaMalloc() cost so much time?

about 2s on GTX295…

But it cost very little time on GT8800.

I know the runtime will initialize when first cudaMalloc() is called.(No other runtime functions called before in my program)

Thanks!

addtion:
Using OpenSUSE 11.1(64-bit) , cuda2.3