Inferior Results on C2070 when Using Streams

I have recently ported an application to CUDA, and since most of the execution time for the application was spent on transferring data I thought I would try to use streaming to overlap the memory transfers (and, less importantly, kernel execution and memory transfer). I took the simpleMultiCopy example from the SDK, and modified it to work with my data. These modifications included performing two input memcpys and using one of the input memory areas as output on the GPU, as well as making sure the kernel wasn’t running on non-existent data and that only data that had been processed was copied back to the host. I’ve included a sketch of the code below to show the order I issue commands:

// Read first input

memcpy(pinned_data_in0[current_stream], pagable_data_in0, stream_size);

memcpy(pinned_data_in1[current_stream], pagable_data_in1, stream_size);

// Upload first frame

cudaMemcpyAsync(d_data_in_out[current_stream],

    pinned_data_in0[current_stream], stream_size, cudaMemcpyHostToDevice,

    stream[current_stream]);

cudaMemcpyAsync(d_data_in[current_stream],

    pinned_data_in1[current_stream], stream_size, cudaMemcpyHostToDevice,

    stream[current_stream]);

for(unsigned int i=0; i<kernel_calls; i++) {

    next_stream = (current_stream + 1) % streams_used;

if(i>=streams_used)

    {

        // Store the result, but only when the stream we are trying to

        // retrieve the result from has actually run once.

        memcpy(&(pagable_data_out[(i - streams_used)*

            (stream_size/sizeof(pagable_data_out[0]))]),

            pinned_data_out[current_stream], stream_size);

    }

if(i+1<kernel_calls)

    {

        // Read new input, but only if the next stream still has data to

        // process.

        memcpy(pinned_data_in0[next_stream],

            &(pagable_data_in0[(i+1)*

            (stream_size/sizeof(pagable_data_in0[0]))]),

            stream_size);

        memcpy(pinned_data_in1[next_stream], 

            &(pagable_data_in1[(i+1)*(stream_size/sizeof(pagable_data_in1[0]))]),

            stream_size);

    }

// Ensure that processing and copying of the last cycle has finished.

    cudaEventSynchronize(cycleDone[next_stream]);

// Process current frame

    myKernel<<<grid, block, 0, stream[current_stream]>>>(

        d_data_const, d_data_in[current_stream],

        d_data_in_out[current_stream], stream_size);

if(i+1<kernel_calls)

    {

        // Upload next frame, but only if there is remaining data to

        // process.

        cudaMemcpyAsync(d_data_in_out[next_stream],

            pinned_data_in0[next_stream], stream_size,

            cudaMemcpyHostToDevice, stream[next_stream]);

        cudaMemcpyAsync(d_data_in[next_stream],

            pinned_data_in1[next_stream], stream_size, cudaMemcpyHostToDevice,

            stream[next_stream]);

    }

// Download current frame

    cudaMemcpyAsync(pinned_data_out[current_stream],

        d_data_in_out[current_stream], stream_size, cudaMemcpyDeviceToHost,

        stream[current_stream]);

cudaEventRecord(cycleDone[current_stream],

        stream[current_stream]);

current_stream = next_stream;

}

// Flush the rest of the results from the pinned memory to the pagable memory.

for(unsigned int i=kernel_calls; i<kernel_calls+streams_used; i++)

{

    next_stream = (current_stream + 1) % streams_used;

if(i>=streams_used)

    {

        // Ensure that processing and copying of the cycle has finished.

        cudaEventSynchronize(cycleDone[current_stream]);

        // Store the result, but only when the stream we are trying to

        // retrieve the result from has actually run once.

        memcpy(&(pagable_data_out[(i-streams_used)*

            (stream_size/sizeof(pagable_data_out[0]))]),

            pinned_data_out[current_stream], stream_size);

    }

current_stream = next_stream;

}

The problem I’m having is that the streaming execution is giving a lower throughput than a non-streaming implementation using synchronous memory transfers, even when I make sure the kernels in both implementations are launching the same number of threads. As an example, when using 4 streams to process 128MB of data (each kernel call processing 8MB), the throughput was around 290MBps, whereas when I prepped and launched a single kernel to process 8MB I got a throughput of 330MBps. Is there some way I am being inefficient in the order I issue commands to the streams?

I’m using the 4.0 toolkit on a Tesla C2070 running on Windows 7 64-bit, compiling my code as win32/Release.

Yes. For best performance, you should queue the operations in the order of the SDK example.

I.e. first the host->device memcpy()s for all streams. After that, kernel launches for all streams. After that, device->host memcpy()s for all streams.

Unfortunately the Programming Guide got it wrong and in it’s example code uses the order you used (all operations for first stream, then all operations for second stream…), which doesn’t seem to provide a speedup with the current CUDA version.

A very useful resource how to get the best performance out of CUDA streams is the following whitepaper by a colleague of mine. You can safely ignore that it deals with streams in the context of CUDA Fortran, as the operation ordering approaches discussed map one-to-one to CUDA C.

Thanks for the responses. I’ve been working on several things in parallel, so it’s taken me a while to apply the suggested new technique, but I think I’ve finally done it. Unfortunately, I am still measuring inferior results when using the streams. Below is the basic strategy I am now taking to use the streams:

///////////////////////////////////////////////////////////////////////////////

// Run all complete passes.

for(unsigned int pass=0; pass<full_passes; pass++)

{

    for(unsigned int stream=0; stream<STREAMS_USED; stream++)

    {

        // Wait until the previous pass has copied from the pinned memory

        // to the device to avoid overwriting data.

        CUDA_SAFE_CALL(cudaEventSynchronize(htodCopyDone[stream]));

// Read first input for each stream into pinned memory.

        memcpy(pinned_data_in0[stream],

               &(pageable_data_in0[(STREAMS_USED*pass + stream)*

               (STREAM_BYTES/sizeof(pageable_data_in0[0]))]), STREAM_BYTES);

        memcpy(pinned_data_in1[stream],

               &(pageable_data_in1[(STREAMS_USED*pass + stream)*

               (STREAM_BYTES/sizeof(pageable_data_in1[0]))]), STREAM_BYTES);

    }

for(unsigned int stream=0; stream<STREAMS_USED; stream++)

    {

        // Copy from pinned memory to device.

        CUDA_SAFE_CALL(cudaMemcpyAsync(d_data_in_out[stream],

            pageable_data_in0[stream], STREAM_BYTES, cudaMemcpyHostToDevice,

            streams[stream]));

        CUDA_SAFE_CALL(cudaMemcpyAsync(d_data_in[stream],

            pinned_data_in1[stream], STREAM_BYTES, cudaMemcpyHostToDevice,

            streams[stream]));

// Mark copy from pinned memory complete so host knows it may copy

        // new values into the pinned memory.

        CUDA_SAFE_CALL(cudaEventRecord(htodCopyDone[stream],

            streams[stream]));

    }

for(unsigned int stream=0; stream<STREAMS_USED; stream++)

    {

        // Execute kernels.

        myKernel<<<grid, block, 0, streams[stream]>>>(

            d_data_const, d_data_in[current_stream],

            d_data_in_out[current_stream], STREAM_BYTES);

    }

// Execute only if previous passes have been run.

    if(pass>0)

    {

        for(unsigned int stream=0; stream<STREAMS_USED; stream++)

        {

            // Wait until the previous pass has copied to the pinned memory

            // to avoid copying incomplete results.

            CUDA_SAFE_CALL(cudaEventSynchronize(dtohCopyDone[stream]));

// Copy previous results from pinned memory pageable memory.

            memcpy(&(pagable_data_out[(STREAMS_USED*(pass-1) + stream)*

                   (STREAM_BYTES/sizeof(pagable_data_out[0]))]),

                   pinned_data_out[stream], STREAM_BYTES);

        }

    }

for(unsigned int stream=0; stream<STREAMS_USED; stream++)

    {

        // Copy from device to pinned memory.

        CUDA_SAFE_CALL(cudaMemcpyAsync(pinned_data_out[stream],

            d_data_in_out[stream], STREAM_BYTES, cudaMemcpyDeviceToHost,

            streams[stream]));

// Mark copy to pinned memory complete so host knows it may copy

        // values from the pinned memory.

        CUDA_SAFE_CALL(cudaEventRecord(dtohCopyDone[stream]));

    }

}

///////////////////////////////////////////////////////////////////////////////

// Run the final (incomplete) pass.

for(unsigned int stream=0; stream<full_streams; stream++)

{

    // Wait until the previous pass has copied from the pinned memory

    // to the device to avoid overwriting data.

    CUDA_SAFE_CALL(cudaEventSynchronize(htodCopyDone[stream]));

// Read first input for each stream into pinned memory.

    memcpy(pageable_data_in0[stream],

           &(pagable_data_in0[(STREAMS_USED*full_passes + stream)*

           (STREAM_BYTES/sizeof(pagable_data_in0[0]))]), STREAM_BYTES);

    memcpy(pinned_data_in1[stream],

           &(pagable_data_in1[(STREAMS_USED*full_passes + stream)*

           (STREAM_BYTES/sizeof(pagable_data_in1[0]))]), STREAM_BYTES);

}

if(partial_streams==1)

{

    // Run the final (underoccupied) stream differently, since we don't

    // want to copy from host memory that isn't allocated.

    // Wait until the previous pass has copied from the pinned memory

    // to the device to avoid overwriting data.

    CUDA_SAFE_CALL(cudaEventSynchronize(htodCopyDone[full_streams]));

// Read first input for each stream into pinned memory.

    memcpy(pageable_data_in0[full_streams],

           &(pagable_data_in0[(STREAMS_USED*full_passes + full_streams)*

           (STREAM_BYTES/sizeof(pagable_data_in0[0]))]), final_stream_bytes);

    memcpy(pinned_data_in1[full_streams],

           &(pagable_data_in1[(STREAMS_USED*full_passes + full_streams)*

           (STREAM_BYTES/sizeof(pagable_data_in1[0]))]),

           16*final_stream_blocks);

}

// Ok to run final (underoccupied) stream the same way as other streams for

// this operation (since all the pinned memory is guaranteed to be

// allocated to the same size).

for(unsigned int stream=0; stream<full_streams+partial_streams; stream++)

{

    // Copy from pinned memory to device.

    CUDA_SAFE_CALL(cudaMemcpyAsync(d_data_in_out[stream],

        pageable_data_in0[stream], STREAM_BYTES, cudaMemcpyHostToDevice,

        streams[stream]));

    CUDA_SAFE_CALL(cudaMemcpyAsync(d_data_in[stream],

        pinned_data_in1[stream], STREAM_BYTES, cudaMemcpyHostToDevice,

        streams[stream]));

// Mark copy from pinned memory complete so host knows it may copy

    // new values into the pinned memory.

    CUDA_SAFE_CALL(cudaEventRecord(htodCopyDone[stream],

        streams[stream]));

}

// Ok to run final (underoccupied) stream the same way as other streams for

// this operation (since the extra data processed will simply not be copied

// into the final pageable output).

for(unsigned int stream=0; stream<full_streams+partial_streams; stream++)

{

    // Execute kernels.

    aes128Kernel<<<grid, block, 0, streams[stream]>>>(

        d_expanded_key, d_data_in[stream],

        d_data_in_out[stream], STREAM_BYTES);

}

if(full_passes>0)

{

    // Run for a full pass since we're copying from the previous pass,

    // which was full.

    for(unsigned int stream=0; stream<STREAMS_USED; stream++)

    {

        // Wait until the previous pass has copied to the pinned memory

        // to avoid copying incomplete results.

        CUDA_SAFE_CALL(cudaEventSynchronize(dtohCopyDone[stream]));

// Copy previous results from pinned memory pageable memory.

        memcpy(&(pagable_data_out[(STREAMS_USED*(full_passes-1) + stream)*

               (STREAM_BYTES/sizeof(pagable_data_out[0]))]),

               pinned_data_out[stream], STREAM_BYTES);

    }

}

// Ok to run final (underoccupied) stream the same way as other streams for

// this operation (since the extra data copied to pinned memory will simply

// not be copied into the final pageable output).

for(unsigned int stream=0; stream<full_streams+partial_streams; stream++)

{

    // Copy from device to pinned memory.

    CUDA_SAFE_CALL(cudaMemcpyAsync(pinned_data_out[stream],

        d_data_in_out[stream], STREAM_BYTES, cudaMemcpyDeviceToHost,

        streams[stream]));

// Mark copy to pinned memory complete so host knows it may copy

    // values from the pinned memory.

    CUDA_SAFE_CALL(cudaEventRecord(dtohCopyDone[stream]));

}

// Copy pinned data from the final (incomplete) pass.

for(unsigned int stream=0; stream<full_streams; stream++)

{

    // Wait until the previous pass has copied to the pinned memory

    // to avoid copying incomplete results.

    CUDA_SAFE_CALL(cudaEventSynchronize(dtohCopyDone[stream]));

// Copy previous results from pinned memory pageable memory.

    memcpy(&(pagable_data_out[(STREAMS_USED*full_passes + stream)*

           (STREAM_BYTES/sizeof(pagable_data_out[0]))]),

           pinned_data_out[stream], STREAM_BYTES);

}

if(partial_streams==1)

{

    // Run the final (underoccupied) stream differently, since we don't

    // want to copy to host memory that isn't allocated.

    // Wait until the previous pass has copied to the pinned memory

    // to avoid copying incomplete results.

    CUDA_SAFE_CALL(cudaEventSynchronize(dtohCopyDone[full_streams]));

// Copy previous results from pinned memory pageable memory.

    memcpy(&(pagable_data_out[(STREAMS_USED*full_passes + full_streams)*

           (STREAM_BYTES/sizeof(pagable_data_out[0]))]),

           pinned_data_out[full_streams], final_stream_bytes);

}

I’m wondering if the problems I am having may be due to the use of memcpys on the host. The only example I’ve found so far that uses them is the simpleMultiCopy example in the SDK, but that example queues commands in a depth-first fashion (almost identical to how I was originally using the streams), as opposed to the breadth-first method you suggested and which I have tried to implement. The simpleStreams example from the SDK ignores any transfer of data between pageable and pinned memory, which is fine for a performance demonstration but naive if one wants to actually process data. Even though the simpleMultiCopy gets less of a performance gain when simulating IO (ie when it uses memcpys to shuttle data in and out of pinned memory), there is still a performance gain rather than the performance loss that I’m getting. Perhaps my use of event synchronizations is less than optimal right now? Any insight would be appreciated.