Hello,
I am new to CUDA and I’m struggling to understand how the queue of GPU tasks interact with the host thread and when synchronization is neccessary.
Say I have the following code snippet (this is uncompiled and I’m ignoring any error checking, I’m sure there are errors, please take it more as pseudocode)
// Compute some stuff for each file in the list
// InputParams holds some data that will be used by kernels to support operations on input data
void do_computations(std::vector<std::string> & input_files, std::vector<InputParams> & input_params)
{
unsigned int input_size = 1000000;
unsigned int bytes = sizeof(float) * input_size;
int num_iters = input_files.size();
float * h_input_data;
float * d_input_data;
float * d_results;
float * h_results;
cudaMallocHost((void **)h_input_data, bytes);
cudaMalloc((void**)d_input_data, bytes);
cudaMalloc((void**)d_results, bytes);
cudaMallocHost((void**)h_results, bytes);
cudaStream_t stream;
cudaStreamCreate(&stream);
for (int i = 0; i < num_iters; i++)
{
// Read input data from the i-th file and store it in pinned memory
readInputData(input_files[i], h_input_data);
cudaMemcpyAsync(d_input_data, h_input_data, bytes, cudaMemcpyHostToDevice, stream);
someCalculation<<<1024,1,0,stream>>>(d_input_data, d_results, input_params[i]);
cudaMemcpyAsync(d_results, h_results, cudaMemcpyDeviceToHost, stream);
}
// Do clean-up tasks: deallocate memory and destroy stream
}
In the code snippet, because the I’m using asynchronous transfers and the kernel launch is non-blocking, the host thread will read some data, enqueue the GPU operations and move to the next iteration. I am assuming that it’s possible for the host thread to get far enough ahead that I could start reading the next batch of input data before before the input data has finished copying to the GPU, overwriting the data I intended to perform calculations on. Is this correct? If so, I know a stream synchronization call will fix this.
When the kernel call is enqueued how is the dependence on the iteration variable, i, handled? What if the host thread is on i = 2, when the i = 1 someCalculation kernel is executed? Will the kernel actually use input_params[2] instead of input_params[1]? My guess is this does not happen.