I am running CUBLAS v2.0 on different streams on a single GPU (Tesla C2050) by subdividing the input matrices (A[x/num_of_streamsy]B[xy] = C[x/num_of_streamsy]), but somehow it is taking more time when I use CUDA streams. Here is a code snippet:
//plan is a struct containing the matrix dimensions and stream numbers
//parallel in nstreams - should be! MAX 16 streams could run concurrently
//Copy A - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyA_in_streams (&plan[i]);
//Copy B - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyB_in_streams (&plan[i]);
//Create handles - serial
for(i = 0; i < nstreams; i++)
handle[i] = create_handle();
//Run kernels - first doing a cublasSetStream(handle, plan->stream) before running cublasDgemm...
for(i = 0; i < nstreams; i++)
cudgemm_kernel_in_streams (&plan[i], handle[i], 1.0f, 1.0f);
//Destroy handles - serial
for(i = 0; i < nstreams; i++)
destroy_handle (handle[i]);
//Copy C - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyC_in_streams (&plan[i]);
EDIT:
I am timing this part of the code, as explained below.
for(i = 0; i < nstreams; i++)
{
//contains cudaMemCpyAsync
cudgemm_copyA_in_streams (&plan[i]);
cudgemm_copyB_in_streams (&plan[i]);
//creating handle and destroying happens outside this loop
cudgemm_kernel_in_streams (&plan[i], handle[i], 1.0f, 1.0f);
//contains cudaMemCpyAsync
cudgemm_copyC_in_streams (&plan[i]);
}
cudaDeviceSynchronize();
So I am bouncing back and forth between streams and assigning work, expecting to get a better execution time. Where am I going wrong? Cross post to stackoverflow - cublas - Issues with CUDA streams - Stack Overflow
As some stackoverflow users suggested, I made certain small modifications. When I time the above loop, I get an time of 0.000284s, vs 1.703289s for the version that does not uses streams (in that version also, I time the two sequential memory copies, kernel invocation and the remaining memCpy). I think since I am not using any synchronization constructs, may be I am printing the time before the computation actually finishes (I find it difficult to believe that there is a 100% improvement).