Weird behavior from non-blocking clWriteBuffer

I’m using in-order-execution command queue with non-blocking read and write buffer. According to the OpenCl documentation, it’s okay to do so because as long as I enqueue the command in the right order, the command queue will execute one-after-another in the order they’re put in the queue. However, sometimes I got wrong output, but most of the time it produces the correct output, so I try using blocking write and non-blocking read, and it always produces the correct output. Anyone know why that is.

I’m not sure if it’s related to this fact. In my code, I’ve never released cl_program, cl_kernel, cl_command_queue, or cl_context, but I do release cl_mem. If you wonder why I’m not doing so, it’s because I’m writing a compiler that generate opencl code, so it’s not a straight forward thing to figure our where is the right place to release things, but releasing is very crucial, I will do that as the first priority.

The following code is equivalent to what I’m implementing. I can’t post the real version because it’s too complicated.
float input[N] = …;
float output[N];
cl_context context = clCreateContext(0, 1, &device_id, NULL, NULL, &err); //in-order-execution command queue
cl_command_queue queue = clCreateCommandQueue( context, device_id, 0, &err );
cl_mem _clmem1 = clCreateBuffer(context, CL_MEM_READ_WRITE, n * sizeof(float), NULL, &err);
cl_mem _clmem2 = clCreateBuffer(context, CL_MEM_READ_WRITE, n * sizeof(float), NULL, &err);
clEnqueueWriteBuffer(queue, _clmem1, CL_FALSE, 0, bytes(), input, 0, NULL, NULL); //non-blocking write
clSetKernelArg(clkern, 0, sizeof(cl_mem), &_clmem1);
clSetKernelArg(clkern, 1, sizeof(cl_mem), &_clmem2);
size_t workdim = {N};
clEnqueueNDRangeKernel(queue, clkern, 1, 0, workdim, NULL, 0, NULL, NULL );
clEnqueueReadBuffer(queue, _clmem, CL_FALSE, 0, bytes(), output, 0, NULL, &eventout); //non-blocking read
{
clGetEventInfo(eventout, CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &ret, NULL);
}while(ret != CL_COMPLETE);
print output
clReleaseMemObject(_clmem1);
clReleaseMemObject(_clmem2);
//Not releasing anything else

When I change non-blocking write to blocking write, it works fine. Please help!

Did you read OpenCL specification? It has a paragraph dedicated to you personally :)

Using clGetEventInfo to determine if a command identified by event has finished execution (i.e. CL_EVENT_COMMAND_EXECUTION_STATUS returns CL_COMPLETE) is not a synchronization point. There are no guarantees that the memory objects being modified by command associated with event will be visible to other enqueued commands.

Lol, good point. I totally missed that part. What can I use then?

First, I can’t use blocking read because I want the host thread to do something else while waiting. For example, if there are multiple gpu devices, when it’s waiting for the buffer read, it can enqueue more commands on different devices.

The option that I can think of is using clSetEventCallBack, but I’m using OpenCl 1.0 which doesn’t have that feature yet.

Also, what does clGetEventInfo is for then? What doest it mean when it returns complete status for read buffer? It doesn’t make sense to make to me to say that read buffer command is complete, but the buffer is not actually visible in the host’s memory.

Thanks in advance.

Well, what about using clGetEventInfo but only as a hint which indicates that subsequent clFinish will likely take little time?

Ok, so I try different combinations of blocking/nonblocking/clfinish for writing and reading buffer on a very simple program that just copies input to output without any calculation. From the result, the hypothesis of “clFinish after clGetEventInfo is complete will fix the problem” is not quite right.

So, there are 4 crucial steps:

  1. clEnqueueWriteBuffer(queue, _clmem1, CL_FALSE, 0, bytes(), input, 0, NULL, NULL); //non-blocking write
  2. clEnqueueNDRangeKernel(queue, clkern, 1, 0, workdim, NULL, 0, NULL, NULL );
  3. clEnqueueReadBuffer(queue, _clmem2, CL_FALSE, 0, bytes(), output, 0, NULL, &eventout); //non-blocking read
  4. { clGetEventInfo(eventout, CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &ret, NULL); }while(ret != CL_COMPLETE);

I print everything out and find out that the write buffer didn’t do the right thing. I read both clmem1 and clmem2 at the every end. clmem1 and clmem2 are the same, but clmem1 is totally different from the input.

Then, I follow Milikov’s suggestion of using clFinish after step 4. It doesn’t fix the problem. It still produce the wrong output. However, as I mentioned earlier, when I make step one blocking write or put clFinish after step1 without clFinish after step 4, it always produces the correct output.

It doesn’t make sense to me because the command queue is in-order-execution, so the run kernel should start after the write buffer finishes despite the fact that the write buffer is blocking or non-blocking. Any thought on this?

Actually, this might be helpful. In fact, I run a lot of these programs concurrently. They are independent from each other. Each of them has its own context and command queue. Note that I have only gpu device, but I have more than one contexts. I think it’s okay to create multiple contexts, but maybe I’m wrong. Someone who knows please clarify.

More observation on this behavior. When I run the program with one context and one command queue, everything is fine. However, when I run the program with one context and multiple command queues, then it generates wrong outputs sometimes.