CUDA stream performance

JFSebastian · May 28, 2013, 9:24pm

I have implemented the simpleStreams SDK example using my expression templates based library.

The code essentially amounts at assigning 5 to all the elements of a GPU array and possibly transferring the result to a CPU array. The code accounts for the following test cases (they essentially trace the SDK example):

CPU and GPU array declarations

Matrix<int>             h_a_matrix(1,n,PINNED); // uses pinned memory
CudaMatrix<int>         d_a_matrix(1,n);

Assignment WITHOUT GPU->CPU memory transfers

streams.InitStreams(nstreams); // uses cudaStreamCreate to create nstreams strams
for(int k = 0; k < nreps; k++)
{
    // asynchronously launch nstreams kernels, each operating on its own 
    // portion of data
    for(int i = 0; i < nstreams; i++)
    { 
        streams.SetStream(i); // sets the active stream to the i-th stream
        // assignment of the elements from i*n/nstreams to (i+1)*n/nstreams-1
        d_a_matrix(Range(i*n/nstreams,(i+1)*n/nstreams-1)) = 5;
    }
}
streams.SynchronizeAll(); // cudaStreamSynchronize

Assignment WITH GPU->CPU memory transfers - approach #1

streams.InitStreams(nstreams);
timer5.StartCounter();
for(int k = 0; k < nreps; k++)
{
    // asynchronously launch nstreams kernels, each operating on its own portion of data
    for(int i = 0; i < nstreams; i++)
    { 
        streams.SetStream(i);
        d_a_matrix(Range(i*n/nstreams,(i+1)*n/nstreams-1)) = 5;
        h_a_matrix(Range(i*n/nstreams,(i+1)*n/nstreams-1))=d_a_matrix(Range(i*n/nstreams,(i+1)*n/nstreams-1));
    }
}
streams.SynchronizeAll();

Assignment WITH GPU->CPU memory transfers - approach #2

streams.InitStreams(nstreams);
 for(int k = 0; k < nreps; k++)
{
    // asynchronously launch nstreams kernels, each operating on its own portion of data
    for(int i = 0; i < nstreams; i++)
    { 
        streams.SetStream(i);
        d_a_matrix(Range(i*n/nstreams,(i+1)*n/nstreams-1)) = 5;
    }
    // asynchronously launch nstreams memcopies.  Note that memcopy in stream x will only
    //   commence executing when all previous CUDA calls in stream x have completed
    for(int i = 0; i < nstreams; i++) {
        streams.SetStream(i);
        h_a_matrix(Range(i*n/nstreams,(i+1)*n/nstreams-1))=d_a_matrix(Range(i*n/nstreams,(i+1)*n/nstreams-1));
    }
}
streams.SynchronizeAll();

To comply with the SDK example, I’m using the following grid to launch the assignment kernel

dim3 dimGrid(iDivUp(NumElements,dimBlock.x*streams.GetNumStreams()));

The timing (taken by using CUDA events) is the same for the three cases and is as follows:

GeForce GT540M - BLOCKSIZE = 512
time in ms: 60.39 [1 stream], 30.32 [2 streams], 15.46 [4 streams], 8.08 [8 streams], 4.76 [16 streams], 3.47 [3 streams], 4.24 [64 streams], 4.5 [128 streams]

Kepler K20c - BLOCKSIZE = 512
time in ms: 9.56 [1 stream], 4.82 [2 streams], 2.46 [4 streams], 1.39 [8 streams], 0.96 [16 streams], 3.47 [32 streams], 1.82 [64 streams], 1.82 [128 streams]

What I observe is the following:

The times, for both the architectures, approximately halve when doubling the number of streams until a saturation occurs; however, the saturation for the k20c occurs earlier;
The memory transfers, for both the architectures, are totally hidden by the computations;
For both the architectures, there is a benefit in using streams also when no GPU->CPU memory transfer is required (only computation).

Provided that my conclusions are correct, I have then three questions:

By which mechanism streams help also with no GPU->CPU memory transfer? Is the card overlapping computations and transfers to global memory? On the K20c, I have observed that I do not have the same effect by using larger thread blocks.

Why the saturation for the k20c occurs earlier?

How can I visualize the overlappings occurring with streams? It seems that the Visual Profiler provided with CUDA 5.0 serializes the streams ((see the last answer to the CUDA streams not overlapping post - http://stackoverflow.com/questions/6070392/cuda-streams-not-overlapping)?

Thank you very much for any help.

njuffa · May 28, 2013, 11:02pm

If you are using the command line profiler, kernels are serialized by default to match the legacy behavior. Use of the option conckerneltrace is required in cuda_prof_conf to allow kernels to overlap with each other.

JFSebastian · June 4, 2013, 9:08pm

Thank you very much for your answer.

I have used the conckerneltrace option in the configuration file of the command line profiler, but the log file in the csv format is somewhat illegible, see

# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GT 540M
# CUDA_CONTEXT 1
# CUDA_PROFILE_CSV 1
# TIMESTAMPFACTOR 130486c3614f94f4
method,gputime,cputime,occupancy
_Z17evaluation_matrixIN16LibraryNameSpace10CudaScalarINS0_8double2_EEES2_S2_EvPT0_NS0_8CudaExprIT_T1_EEi,25.600,16.934,1.000
memcpyHtoDasync,27.424,9.237
memcpyHtoDasync,50.432,3.079
_ZN16LibraryNameSpace25zero_padding_NUFFT_NER_1DEPNS_8double2_ES1_i,13.312,15.907,0.333
_ZN12dpRadix0025B10kernel1MemIL14fftDirection_tn1EEEvP7ComplexIdEPKS3_jj9divisor_tS7_S7_15coordDivisors_t5CoordIjESA_jjd,10.240,9.750,0.208
_ZN12dpRadix0064B10kernel1MemIL14fftDirection_tn1EEEvP7ComplexIdEPKS3_jj9divisor_tS7_S7_15coordDivisors_t5CoordIjESA_jjd,21.248,6.671,0.167
_ZN16LibraryNameSpace13interpolationEPNS_8double2_EPKdS1_ii,1727.488,9.750,0.333
_Z17evaluation_matrixIN16LibraryNameSpace11CudaBinExprIPKNS0_8double2_ES4_NS0_9CudaOpSumES2_EES2_S2_EvPT0_NS0_8CudaExprIT_T1_EEi,60.160,8.723,1.000
memcpyHtoDasync,7.936,4.618
memcpyHtoDasync,34.560,2.566
...

Is there a way to import it in the Visual Profiler to improve the readability? Is the command line profiler the only way to see kernels overlap?
Thank you again as usual.

mfatica · June 4, 2013, 9:31pm

You can import the trace in nvvp:
File/Import

JFSebastian · June 6, 2013, 8:36am

Thank you very much. I can now clearly see the overlap between computation and memory transfers (see attached file).

David_Goodwin · July 23, 2013, 11:14pm

If you use CUDA 5.0 or later nvvp will not force serialization of the kernels (unless you explicitly disable “concurrent kernel” in the session properties). You should also start using nvprof for your command-line profiler since it is much more full-featured and accurate than the “legacy” command-line profiler. nvprof is described in the Profiler User’s Guide.