I have implemented the simpleStreams SDK example using my expression templates based library.
The code essentially amounts at assigning 5 to all the elements of a GPU array and possibly transferring the result to a CPU array. The code accounts for the following test cases (they essentially trace the SDK example):
CPU and GPU array declarations
Matrix<int> h_a_matrix(1,n,PINNED); // uses pinned memory
CudaMatrix<int> d_a_matrix(1,n);
Assignment WITHOUT GPU->CPU memory transfers
streams.InitStreams(nstreams); // uses cudaStreamCreate to create nstreams strams
for(int k = 0; k < nreps; k++)
{
// asynchronously launch nstreams kernels, each operating on its own
// portion of data
for(int i = 0; i < nstreams; i++)
{
streams.SetStream(i); // sets the active stream to the i-th stream
// assignment of the elements from i*n/nstreams to (i+1)*n/nstreams-1
d_a_matrix(Range(i*n/nstreams,(i+1)*n/nstreams-1)) = 5;
}
}
streams.SynchronizeAll(); // cudaStreamSynchronize
Assignment WITH GPU->CPU memory transfers - approach #1
streams.InitStreams(nstreams);
timer5.StartCounter();
for(int k = 0; k < nreps; k++)
{
// asynchronously launch nstreams kernels, each operating on its own portion of data
for(int i = 0; i < nstreams; i++)
{
streams.SetStream(i);
d_a_matrix(Range(i*n/nstreams,(i+1)*n/nstreams-1)) = 5;
h_a_matrix(Range(i*n/nstreams,(i+1)*n/nstreams-1))=d_a_matrix(Range(i*n/nstreams,(i+1)*n/nstreams-1));
}
}
streams.SynchronizeAll();
Assignment WITH GPU->CPU memory transfers - approach #2
streams.InitStreams(nstreams);
for(int k = 0; k < nreps; k++)
{
// asynchronously launch nstreams kernels, each operating on its own portion of data
for(int i = 0; i < nstreams; i++)
{
streams.SetStream(i);
d_a_matrix(Range(i*n/nstreams,(i+1)*n/nstreams-1)) = 5;
}
// asynchronously launch nstreams memcopies. Note that memcopy in stream x will only
// commence executing when all previous CUDA calls in stream x have completed
for(int i = 0; i < nstreams; i++) {
streams.SetStream(i);
h_a_matrix(Range(i*n/nstreams,(i+1)*n/nstreams-1))=d_a_matrix(Range(i*n/nstreams,(i+1)*n/nstreams-1));
}
}
streams.SynchronizeAll();
To comply with the SDK example, I’m using the following grid to launch the assignment kernel
dim3 dimGrid(iDivUp(NumElements,dimBlock.x*streams.GetNumStreams()));
The timing (taken by using CUDA events) is the same for the three cases and is as follows:
GeForce GT540M - BLOCKSIZE = 512
time in ms: 60.39 [1 stream], 30.32 [2 streams], 15.46 [4 streams], 8.08 [8 streams], 4.76 [16 streams], 3.47 [3 streams], 4.24 [64 streams], 4.5 [128 streams]
Kepler K20c - BLOCKSIZE = 512
time in ms: 9.56 [1 stream], 4.82 [2 streams], 2.46 [4 streams], 1.39 [8 streams], 0.96 [16 streams], 3.47 [32 streams], 1.82 [64 streams], 1.82 [128 streams]
What I observe is the following:
- The times, for both the architectures, approximately halve when doubling the number of streams until a saturation occurs; however, the saturation for the k20c occurs earlier;
- The memory transfers, for both the architectures, are totally hidden by the computations;
- For both the architectures, there is a benefit in using streams also when no GPU->CPU memory transfer is required (only computation).
Provided that my conclusions are correct, I have then three questions:
By which mechanism streams help also with no GPU->CPU memory transfer? Is the card overlapping computations and transfers to global memory? On the K20c, I have observed that I do not have the same effect by using larger thread blocks.
Why the saturation for the k20c occurs earlier?
How can I visualize the overlappings occurring with streams? It seems that the Visual Profiler provided with CUDA 5.0 serializes the streams ((see the last answer to the CUDA streams not overlapping post - http://stackoverflow.com/questions/6070392/cuda-streams-not-overlapping)?
Thank you very much for any help.