Stream Synchronization Questions

aarsmith54 · January 17, 2019, 3:35am

Hello,

I am new to CUDA and I’m struggling to understand how the queue of GPU tasks interact with the host thread and when synchronization is neccessary.

Say I have the following code snippet (this is uncompiled and I’m ignoring any error checking, I’m sure there are errors, please take it more as pseudocode)

// Compute some stuff for each file in the list
// InputParams holds some data that will be used by kernels to support operations on input data
void do_computations(std::vector<std::string> & input_files, std::vector<InputParams> & input_params)
{
    unsigned int input_size = 1000000;
    unsigned int bytes = sizeof(float) * input_size;
    int num_iters = input_files.size();

    float * h_input_data;
    float * d_input_data;
    float * d_results;
    float * h_results;

    cudaMallocHost((void **)h_input_data, bytes);
    cudaMalloc((void**)d_input_data, bytes);
    cudaMalloc((void**)d_results, bytes);
    cudaMallocHost((void**)h_results, bytes);

    cudaStream_t stream;
    cudaStreamCreate(&stream);
    for (int i = 0; i < num_iters; i++)
    {
        // Read input data from the i-th file and store it in pinned memory
        readInputData(input_files[i], h_input_data);
        cudaMemcpyAsync(d_input_data, h_input_data, bytes, cudaMemcpyHostToDevice, stream);
        someCalculation<<<1024,1,0,stream>>>(d_input_data, d_results, input_params[i]);
        cudaMemcpyAsync(d_results, h_results, cudaMemcpyDeviceToHost, stream);
    }
    // Do clean-up tasks: deallocate memory and destroy stream
}

In the code snippet, because the I’m using asynchronous transfers and the kernel launch is non-blocking, the host thread will read some data, enqueue the GPU operations and move to the next iteration. I am assuming that it’s possible for the host thread to get far enough ahead that I could start reading the next batch of input data before before the input data has finished copying to the GPU, overwriting the data I intended to perform calculations on. Is this correct? If so, I know a stream synchronization call will fix this.

When the kernel call is enqueued how is the dependence on the iteration variable, i, handled? What if the host thread is on i = 2, when the i = 1 someCalculation kernel is executed? Will the kernel actually use input_params[2] instead of input_params[1]? My guess is this does not happen.

Robert_Crovella · January 17, 2019, 4:03am

Yes, it’s possible that on the second iteration of the for-loop, the readInputData routine will overwrite data in h_input_data that has not yet been fully transferred to the device.

The kernel call is enqueued with its parameters. The kernel call enqueued in the iteration i=1 will use input_params[1]

Topic		Replies	Views
A few new to CUDA questions CUDA Programming and Performance	3	1110	February 4, 2011
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2945	December 18, 2008
confusions about CUDA streams CUDA Programming and Performance	5	789	July 30, 2017
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13095	March 30, 2011
CUDA and NPP Misc Issues CUDA Programming and Performance	6	1449	March 28, 2011
Async questions Kernels appear to stall host threads CUDA Programming and Performance	3	2256	January 20, 2008
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1419	September 14, 2017
Overlapping execution / data transfer & kernel execution order CUDA Programming and Performance	2	671	December 10, 2015
About Stream control CUDA Programming and Performance	1	938	March 26, 2009
My streams are not running concurrently CUDA Programming and Performance	7	1721	March 6, 2018

Stream Synchronization Questions

Related topics