Recently I encountered a strange problem that everytime I launch my program, it takes some 5000ms for the first cudaMalloc and after that does everything seems ok. And I run some SDK samples and also found 5000ms lantency exists before the program really takes effect. Is that something wrong with the driver? My GPU is S870.

It is initialization overhead for the context setup and such (I believe it is needed for each GPU)
I believe NVIDIA is working on getting it down.

Specifically, it was mentioned that 1+ second startup times are an issue running on a linux console w/o X windows. Startup times with X windows are ~100ms. It was mentioned that CUDA 2.2 might improve this. If I have the time, I might try out timing the beta later today.