Kernel Functions Signifying Termination Of Numeric Optimization Problem


Suppose I need to solve \argmin_x f(x) (x is a vector), and, in order to do so, I run a loop of the form

x = initial_x
prev_x = x
x = \argmin_x g(prev_x)
while x significantly different from prev_x

Furthermore, suppose g() is a separable mathematical function, and so the line x = \argmin… can be implemented efficiently via a set of kernel calls: the first optimizing x_1, the second x_2, and so on. How can I know to terminate the loop? It depends on whether at least one kernel invocation i encountered a change in its x_i. There seem two possibilities:

  1. Keep moving x to and from device memory, check termination in host: works, but seems expensive
  2. Keep x in device memory: seems efficient, but how do I pass back this information to the host?

Could you please advise?

Thanks & Bye,


If both x and prev_x are in device memory - just compute the “difference” between them there. If you’re doing something a L1 or L2 norm this boils down to a transformed reduction. Then you simply need to transfer one number back to the host and decide if it is too small; if it is you can break out of the loop. Unless your function g is extremely fast, the latency of the transfer will not be a bottleneck.

If the latency is a concern, you can also set a flag on the device that indicates whether more iterations are necessary. If you check that flag at the beginning of each kernel, you can queue a few iterations in advance without needing to transfer the flag to the host each time.

Eelsen and Tera,

Many thanks for your replies!

I’m having some difficulty in understanding what is the most efficient way to perform the transformed reduction. The metric used is actually the l-infinity norm, i.e., the max over all indices of the absolute difference between the two vectors. So it goes something like this:

// figure out that I’m dealing with index i

// Do some optimization stuff

// See if, as far as I’m concerned, there’s a need to continue the outer iteration.
// prev_x and x are in device memory.
if(::fabs(prev_x[i] - x[i]) > eps)
(*) // What goes here?

invoke kernel_function_g(…)
while (**)

The problem is how to connect the line with (*) to the line with (**):

  1. If (*) sets some boolean (in device memory) to true, and (**) accesses this boolean (in host memory, after transfer), then the transfer is efficient, but isn’t there a problem with all kernels accessing the same boolean?
  2. If (*) accesses some “done” array at index i (in device memory), and (**) iterates over the “done” array (in host memory, after transfer), then the kernels are efficient, but isn’t the transfer expensive?

Thanks & Bye,


One way to do the L-infinity norm test is to have a convergence flag for each block in the kernel launch.

Each thread can check local convergence and set a flag in a shared memory array indicating whether it has converged. At the end of the kernel, do an in-shared memory parallel reduction, which leaves you with one flag per block, which indicates whether the whole block has converged or not. Write that convergence flag value back to global memory at the end of the kernel. You can then read that status array back to the host periodically and test it to see whether you should launch more kernels or not. So a bare bones implementation in the kernel might look something like this:

__global__ void(......., int *converged)

__shared__ volatile int local_convergence[NTHREADS];

    __shared__ volatile int block_convergence;

if (threadIdx.x == 0) { block_convergence = converged[blockIdx.x] }; __syncthreads();

if (block_convergence) return;

// per thread computation goes here

// convergence test

   local_convergence[threadIdx.x] = (fabs(new-old) < tol); __syncthreads();

// Parallel reduction of local_convergence into block_convergence here

if (threadIdx.x == 0) { converged[blockIdx.x] = block_convergence };


The effect is that once a block has converged, it won’t do any more computation, just exit. Once all the blocks have converged, the whole kernel launch will do nothing. Your host code can copy back the convergence array to check it at suitable intervals, lauching more kernels until everything has converged. The per block convergence array won’t be that large or take so much time to copy back to the host, and you can overlap the checking operation with kernel computation, making it effectively free of additional computational cost.

Sorry, just saw your reply. Many thanks! This seems like a very good solution.

Thanks & Bye,