Weird behavior from non-blocking clWriteBuffer

Mangpo · July 30, 2011, 7:31pm

I’m using in-order-execution command queue with non-blocking read and write buffer. According to the OpenCl documentation, it’s okay to do so because as long as I enqueue the command in the right order, the command queue will execute one-after-another in the order they’re put in the queue. However, sometimes I got wrong output, but most of the time it produces the correct output, so I try using blocking write and non-blocking read, and it always produces the correct output. Anyone know why that is.

I’m not sure if it’s related to this fact. In my code, I’ve never released cl_program, cl_kernel, cl_command_queue, or cl_context, but I do release cl_mem. If you wonder why I’m not doing so, it’s because I’m writing a compiler that generate opencl code, so it’s not a straight forward thing to figure our where is the right place to release things, but releasing is very crucial, I will do that as the first priority.

The following code is equivalent to what I’m implementing. I can’t post the real version because it’s too complicated.
float input[N] = …;
float output[N];
cl_context context = clCreateContext(0, 1, &device_id, NULL, NULL, &err); //in-order-execution command queue
cl_command_queue queue = clCreateCommandQueue( context, device_id, 0, &err );
cl_mem _clmem1 = clCreateBuffer(context, CL_MEM_READ_WRITE, n * sizeof(float), NULL, &err);
cl_mem _clmem2 = clCreateBuffer(context, CL_MEM_READ_WRITE, n * sizeof(float), NULL, &err);
clEnqueueWriteBuffer(queue, _clmem1, CL_FALSE, 0, bytes(), input, 0, NULL, NULL); //non-blocking write
clSetKernelArg(clkern, 0, sizeof(cl_mem), &_clmem1);
clSetKernelArg(clkern, 1, sizeof(cl_mem), &_clmem2);
size_t workdim = {N};
clEnqueueNDRangeKernel(queue, clkern, 1, 0, workdim, NULL, 0, NULL, NULL );
clEnqueueReadBuffer(queue, _clmem, CL_FALSE, 0, bytes(), output, 0, NULL, &eventout); //non-blocking read
{
clGetEventInfo(eventout, CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &ret, NULL);
}while(ret != CL_COMPLETE);
print output
clReleaseMemObject(_clmem1);
clReleaseMemObject(_clmem2);
//Not releasing anything else

When I change non-blocking write to blocking write, it works fine. Please help!

Maxim_Milakov · July 30, 2011, 8:10pm

Did you read OpenCL specification? It has a paragraph dedicated to you personally :)

Using clGetEventInfo to determine if a command identified by event has finished execution (i.e. CL_EVENT_COMMAND_EXECUTION_STATUS returns CL_COMPLETE) is not a synchronization point. There are no guarantees that the memory objects being modified by command associated with event will be visible to other enqueued commands.

Mangpo · July 30, 2011, 8:47pm

Lol, good point. I totally missed that part. What can I use then?

First, I can’t use blocking read because I want the host thread to do something else while waiting. For example, if there are multiple gpu devices, when it’s waiting for the buffer read, it can enqueue more commands on different devices.

The option that I can think of is using clSetEventCallBack, but I’m using OpenCl 1.0 which doesn’t have that feature yet.

Also, what does clGetEventInfo is for then? What doest it mean when it returns complete status for read buffer? It doesn’t make sense to make to me to say that read buffer command is complete, but the buffer is not actually visible in the host’s memory.

Thanks in advance.

Maxim_Milakov · July 30, 2011, 9:59pm

Well, what about using clGetEventInfo but only as a hint which indicates that subsequent clFinish will likely take little time?

Mangpo · August 1, 2011, 7:28pm

Ok, so I try different combinations of blocking/nonblocking/clfinish for writing and reading buffer on a very simple program that just copies input to output without any calculation. From the result, the hypothesis of “clFinish after clGetEventInfo is complete will fix the problem” is not quite right.

So, there are 4 crucial steps:

clEnqueueWriteBuffer(queue, _clmem1, CL_FALSE, 0, bytes(), input, 0, NULL, NULL); //non-blocking write
clEnqueueNDRangeKernel(queue, clkern, 1, 0, workdim, NULL, 0, NULL, NULL );
clEnqueueReadBuffer(queue, _clmem2, CL_FALSE, 0, bytes(), output, 0, NULL, &eventout); //non-blocking read
{ clGetEventInfo(eventout, CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &ret, NULL); }while(ret != CL_COMPLETE);

I print everything out and find out that the write buffer didn’t do the right thing. I read both clmem1 and clmem2 at the every end. clmem1 and clmem2 are the same, but clmem1 is totally different from the input.

Then, I follow Milikov’s suggestion of using clFinish after step 4. It doesn’t fix the problem. It still produce the wrong output. However, as I mentioned earlier, when I make step one blocking write or put clFinish after step1 without clFinish after step 4, it always produces the correct output.

It doesn’t make sense to me because the command queue is in-order-execution, so the run kernel should start after the write buffer finishes despite the fact that the write buffer is blocking or non-blocking. Any thought on this?

Mangpo · August 1, 2011, 10:16pm

Actually, this might be helpful. In fact, I run a lot of these programs concurrently. They are independent from each other. Each of them has its own context and command queue. Note that I have only gpu device, but I have more than one contexts. I think it’s okay to create multiple contexts, but maybe I’m wrong. Someone who knows please clarify.

Mangpo · August 2, 2011, 8:46pm

More observation on this behavior. When I run the program with one context and one command queue, everything is fine. However, when I run the program with one context and multiple command queues, then it generates wrong outputs sometimes.

Topic		Replies	Views
clGetEventInfo Broken? EXECUTION_STATUS gets stuck on CL_QUEUED CUDA Programming and Performance	17	11601	March 9, 2023
Simple OpenCL program compiles and runs, gives incorrect output CUDA Programming and Performance	5	8611	November 19, 2009
clSetEventCallback stops working after 4 callbacks CUDA Programming and Performance opencl	5	1120	September 23, 2022
OpenCL API of clCreateBuffer() does not work as expected in a abnormal case CUDA Programming and Performance	2	800	February 20, 2019
clEnqueueWriteBuffer excessive memory usage CUDA Programming and Performance	3	2300	January 11, 2011
clEnqueueWriteBuffer returns CL_OUT_OF_RESOURCES CUDA Programming and Performance	10	13476	April 19, 2010
how are 'device' buffers actually allocated with multiple devices in a context clCreateBuffe CUDA Programming and Performance	9	4968	December 14, 2011
OpenCL Asynchronous Kernel Launches CUDA Programming and Performance	6	2696	May 24, 2022
Program crash when copying data onto device CUDA Programming and Performance	2	4154	January 15, 2012
OpenCL in Fedora vs Windows Same code. Runs in Windows. Kernel error in Fedora CUDA Programming and Performance	7	2614	December 11, 2009

Weird behavior from non-blocking clWriteBuffer

Related topics