Does CUBLAS 4.0.13 support multiple streams and multiple device from a single host-thread ?

I use CUDA 4.0.13 and tesla C2050

cudaSetDevice(0)
cudaStreamCreate(&s0)
cublasCreate(&b0)
… mem copy to dev

cudaSetDevice(1)
cudaStreamCreate(&s1)
cublasCreate(&b1)
… mem copy to dev

cudaSetDevice(0)
cublasSetStream(s0)
cublasSgemm(b0, … )

cudaSetDevice(1)
cublasSetStream(s1)
cublasSgemm(b1, … )

… // do other work

For example, time of exec 1 context with 1 GPU = 10sec, time of exec 2 context with 2 GPU = 20sec. Henсe, switch context it dont work or calculations run from synchronic mode.

Why it dont work ? Or CUDA 4.0.13 unsupport cublas multugpu with asynchronic mode ?

P.S. sorry for my english