Stream Synchronization Questions

Hello,

I am new to CUDA and I’m struggling to understand how the queue of GPU tasks interact with the host thread and when synchronization is neccessary.

Say I have the following code snippet (this is uncompiled and I’m ignoring any error checking, I’m sure there are errors, please take it more as pseudocode)

// Compute some stuff for each file in the list
// InputParams holds some data that will be used by kernels to support operations on input data
void do_computations(std::vector<std::string> & input_files, std::vector<InputParams> & input_params)
{
    unsigned int input_size = 1000000;
    unsigned int bytes = sizeof(float) * input_size;
    int num_iters = input_files.size();

    float * h_input_data;
    float * d_input_data;
    float * d_results;
    float * h_results;

    cudaMallocHost((void **)h_input_data, bytes);
    cudaMalloc((void**)d_input_data, bytes);
    cudaMalloc((void**)d_results, bytes);
    cudaMallocHost((void**)h_results, bytes);

    cudaStream_t stream;
    cudaStreamCreate(&stream);
    for (int i = 0; i < num_iters; i++)
    {
        // Read input data from the i-th file and store it in pinned memory
        readInputData(input_files[i], h_input_data);
        cudaMemcpyAsync(d_input_data, h_input_data, bytes, cudaMemcpyHostToDevice, stream);
        someCalculation<<<1024,1,0,stream>>>(d_input_data, d_results, input_params[i]);
        cudaMemcpyAsync(d_results, h_results, cudaMemcpyDeviceToHost, stream);
    }
    // Do clean-up tasks: deallocate memory and destroy stream
}

In the code snippet, because the I’m using asynchronous transfers and the kernel launch is non-blocking, the host thread will read some data, enqueue the GPU operations and move to the next iteration. I am assuming that it’s possible for the host thread to get far enough ahead that I could start reading the next batch of input data before before the input data has finished copying to the GPU, overwriting the data I intended to perform calculations on. Is this correct? If so, I know a stream synchronization call will fix this.

When the kernel call is enqueued how is the dependence on the iteration variable, i, handled? What if the host thread is on i = 2, when the i = 1 someCalculation kernel is executed? Will the kernel actually use input_params[2] instead of input_params[1]? My guess is this does not happen.

Yes, it’s possible that on the second iteration of the for-loop, the readInputData routine will overwrite data in h_input_data that has not yet been fully transferred to the device.

The kernel call is enqueued with its parameters. The kernel call enqueued in the iteration i=1 will use input_params[1]