I’m working on an iterative algorithm and as there is no possibility
for global sync within a kernel i’m iterating over clEnqueueNDRangeKernel(…)
Kernel calls:
int first_input = 1;
for(x = 0; x < iterations; x++)
{
err = clEnqueueNDRangeKernel(cmd_queue, kernels[kernel_id], 2, NULL,
global_work_size, local_work_size, 0, NULL, &kernel_event);
clFinish(cmd_queue);
first_input = (first_input == 1) ? 0 : 1;
err |= clSetKernelArg(kernels[kernel_id], 7, sizeof(int), &first_input);
}
...
if(iterations%2)
err = clEnqueueReadBuffer(cmd_queue, mem1, CL_TRUE, 0, buff_size_padded, result, 0, NULL, NULL);
else
err = clEnqueueReadBuffer(cmd_queue, mem2, CL_TRUE, 0, buff_size_padded, result, 0, NULL, NULL);
...
first_input variable is used to switch matrices used in kernel
e.g.
iter 1: mat1 = input, mat2 = output
iter 2: mat1 = output, mat2 = input
iter 3: mat1 = input, mat2 = output
…
After some testing i found out when removing clFinish from code above, commands in command queue seem to
be executed when reading back data (no matter if i uses blocking read or non-blocking).
This has no significant effect to total execution time and results always the same (as far as I’ve tested).
As execution is in-order as default, I assume that calling clEnqueueReadBuffer has the same effect than clFinish (+ reading data).
But is it garanteed that kernel-executions are executed with correct kernel argument without clFinish?
e.g.
iter 1: mat1 = input, mat2 = output, first_input = 1;
iter 2: mat1 = output, mat2 = input, first_input = 0;
iter 3: mat1 = input, mat2 = output, first_input = 1;