clFinish / Blocking Read sync iterative kernel calls

noxnet · March 4, 2010, 1:48pm

I’m working on an iterative algorithm and as there is no possibility

for global sync within a kernel i’m iterating over clEnqueueNDRangeKernel(…)

Kernel calls:

int first_input = 1;

for(x = 0; x < iterations; x++)

{

	err = clEnqueueNDRangeKernel(cmd_queue, kernels[kernel_id], 2, NULL, 

		global_work_size, local_work_size, 0, NULL, &kernel_event);	

		

	clFinish(cmd_queue);

	first_input = (first_input == 1) ? 0 : 1;

	err |= clSetKernelArg(kernels[kernel_id], 7, sizeof(int), &first_input);

}

...

if(iterations%2)

	err = clEnqueueReadBuffer(cmd_queue, mem1, CL_TRUE, 0, buff_size_padded, result, 0, NULL, NULL);

else

	err = clEnqueueReadBuffer(cmd_queue, mem2, CL_TRUE, 0, buff_size_padded, result, 0, NULL, NULL);

...

first_input variable is used to switch matrices used in kernel

e.g.

iter 1: mat1 = input, mat2 = output

iter 2: mat1 = output, mat2 = input

iter 3: mat1 = input, mat2 = output

…

After some testing i found out when removing clFinish from code above, commands in command queue seem to

be executed when reading back data (no matter if i uses blocking read or non-blocking).

This has no significant effect to total execution time and results always the same (as far as I’ve tested).

As execution is in-order as default, I assume that calling clEnqueueReadBuffer has the same effect than clFinish (+ reading data).

But is it garanteed that kernel-executions are executed with correct kernel argument without clFinish?

e.g.

iter 1: mat1 = input, mat2 = output, first_input = 1;

iter 2: mat1 = output, mat2 = input, first_input = 0;

iter 3: mat1 = input, mat2 = output, first_input = 1;

jcpalmer · March 4, 2010, 4:40pm

I think any time a kernel is queued the args at the time should be used without requiring a finish. It is a queue after all.

I think some blocking should be employed though. No blocking may always work on a given platform, but for portability / changes in future platform releases, I would not rely on this myself. One thing you might do is replace the finish with a flush, and then after the loop do a finish or blocking read.

If you have a good idea of the minimum time it takes to execute the entire set & you wish to possibly free up a CPU core from spin waiting, then you might also put in a short thread sleep after the loop & before the blocking call. Only do this if you know the CPU time could be being used elsewhere, like maybe more GPU’s on different threads or other stuff running on the system. This also may have not effect on some platforms. It depends on how a platform detects when things are done.

Hellgruen · March 7, 2010, 9:06am

This happens when your memory you read into is not pinned memory. You need pinned memory for asynchronous transfers.

noxnet · March 7, 2010, 9:32am

I don’t need asynchronous transfers because i read back data only once after all iterations are done.

But pinned memory sounds interessting. I read little about it, but can’t figure out details. I also opened a thread about it last week, but without response until now.

Hope that you can give me some information.

I read that using calling clCreateBuffer with flag CL_MEM_ALLOC_HOST_PTR set it is likey but not assured that pinned memory is used. So how do i verify that it is used? Is there any possibilty?

What size limitations are there? Is it big enough to put 2 Matrices 1024x1024 (float) into?

What benefits can i expect?

In which cases should it be used?

Hellgruen · March 8, 2010, 7:22pm

You cant verify if you got pinned memory or not. But chances are very high that you get pinned memory for reasonable sizes.

You can allocate quite a lot of pinned memory where the size is mostly limited by the physical memory in the machine the application is run on. As pinned memory cant be paged out allocating too much may lead to excessive swap activity, degrading overall system performance. Using a couple of megabytes wont hurt though but i would not go over a quarter of physical memory.

Benefits are that you offload the copying process to the GPU instead of letting the CPU do the work instantly when doing the memory operation. This assumes that you work with the mapped pointer of the Buffer Object you allocated pinned and copy that over to a device memory buffer. Working with normal memory, copying that to pinned memory and that to device memory is mostly slower than using pageable memory.

Its hard to say when to use pinned memory. Id use it for host → device and device → host transfers only and not application global. If your application is only doing things in OpenCL i would only use pinned memory for the Host side.