CUBLAS on CUDA Streams

I am running CUBLAS v2.0 on different streams on a single GPU (Tesla C2050) by subdividing the input matrices (A[x/num_of_streamsy]B[xy] = C[x/num_of_streamsy]), but somehow it is taking more time when I use CUDA streams. Here is a code snippet:

//plan is a struct containing the matrix dimensions and stream numbers

                 //parallel in nstreams - should be! MAX 16 streams could run concurrently

                //Copy A - cudaMemCpyAsync

                for(i = 0; i < nstreams; i++)

                        cudgemm_copyA_in_streams (&plan[i]);

                //Copy B - cudaMemCpyAsync

                for(i = 0; i < nstreams; i++)

                        cudgemm_copyB_in_streams (&plan[i]);

//Create handles - serial

                for(i = 0; i < nstreams; i++)

                        handle[i] = create_handle();

//Run kernels - first doing a cublasSetStream(handle, plan->stream) before running cublasDgemm... 

                for(i = 0; i < nstreams; i++)

                        cudgemm_kernel_in_streams (&plan[i], handle[i], 1.0f, 1.0f);

//Destroy handles - serial

                for(i = 0; i < nstreams; i++)

                        destroy_handle (handle[i]);

//Copy C - cudaMemCpyAsync

                for(i = 0; i < nstreams; i++)

                        cudgemm_copyC_in_streams (&plan[i]);

EDIT:

I am timing this part of the code, as explained below.

for(i = 0; i < nstreams; i++)

                {

                        //contains cudaMemCpyAsync

                        cudgemm_copyA_in_streams (&plan[i]);

                        cudgemm_copyB_in_streams (&plan[i]);

//creating handle and destroying happens outside this loop

                        cudgemm_kernel_in_streams (&plan[i], handle[i], 1.0f, 1.0f);

//contains cudaMemCpyAsync

                        cudgemm_copyC_in_streams (&plan[i]);

                }

cudaDeviceSynchronize();

So I am bouncing back and forth between streams and assigning work, expecting to get a better execution time. Where am I going wrong? Cross post to stackoverflow - http://stackoverflow.com/questions/7303599/issues-with-cuda-streams

As some stackoverflow users suggested, I made certain small modifications. When I time the above loop, I get an time of 0.000284s, vs 1.703289s for the version that does not uses streams (in that version also, I time the two sequential memory copies, kernel invocation and the remaining memCpy). I think since I am not using any synchronization constructs, may be I am printing the time before the computation actually finishes (I find it difficult to believe that there is a 100% improvement).