Can I reuse threads in one direction of a grid several times within a kernel in a serial way?

Often in kernels I’m launching two or three-dimensional grids where only part of the calculations in the kernel depend on all the available dimensions. As in this example kernel:

__global__ void AllInOneKernel(double *x, double *SomeVariable, double *SomeOtherVariable, const size_t k, const size_t n) {
	for (int v = blockIdx.x * blockDim.x + threadIdx.x; v < k; v += blockDim.x * gridDim.x) {
		// Computations in x and y direction.
		for (int row = blockIdx.y * blockDim.y + threadIdx.y; row < n; row += blockDim.y * gridDim.y) {
			atomicAdd(x + v, SomeVariable[row]);
		}
		// Computations in only x direction.
		x[v] = sqrt(x[v]);
		// Computations in x and y direction again.
		for (int row = blockIdx.y * blockDim.y + threadIdx.y; row < n; row += blockDim.y * gridDim.y) {
			SomeOtherVariable[row + n*v] *= x[v];
		}
	}
}

First I need to use a two-dimensional grid, then temporarily put the threads in the y-direction on hold while I continue to use the x-direction and then lastly use the two-dimensional grid again.
If I launch the kernel in this way, the results I get are incorrect. If I instead partition the calculations into three separate kernels in this way:

__global__ void Kernel1(double *x, double *SomeVariable, double *SomeOtherVariable, const size_t k, const size_t n) {
	for (int v = blockIdx.x * blockDim.x + threadIdx.x; v < k; v += blockDim.x * gridDim.x) {
		// Computations in x and y direction.
		for (int row = blockIdx.y * blockDim.y + threadIdx.y; row < n; row += blockDim.y * gridDim.y) {
			atomicAdd(x + v, SomeVariable[row]);
		}
	}
}
__global__ void Kernel2(double *x, double *SomeVariable, double *SomeOtherVariable, const size_t k, const size_t n) {
	for (int v = blockIdx.x * blockDim.x + threadIdx.x; v < k; v += blockDim.x * gridDim.x) {
		// Computations in only x direction.
		x[v] = sqrt(x[v]);
	}
}
__global__ void Kernel3(double *x, double *SomeVariable, double *SomeOtherVariable, const size_t k, const size_t n) {
	for (int v = blockIdx.x * blockDim.x + threadIdx.x; v < k; v += blockDim.x * gridDim.x) {
		// Computations in x and y direction again.
		for (int row = blockIdx.y * blockDim.y + threadIdx.y; row < n; row += blockDim.y * gridDim.y) {
			SomeOtherVariable[row + n*v] *= x[v];
		}
	}
}

The results are correct. But since there is an overhead associated with launching each kernel I’m guessing it’s not a great idea to divide my calculations across a myriad of kernels like this.

So what I wonder is: how can I change the first kernel to produce the correct results without the need for three kernels? I’ve tried using __syncthreads() between the calculations but it doesn’t help. How is this problem usually solved?

In your first code example, when you “switch” from a 2D grid to a 1-D grid, it’s not clear if you are actually properly conditioning the code to only permit certain threads to act - e.g. one “row” of threads. If not, then that may be a source of error.

But it also seems that what you’re up against is the need for a grid-wide sync operation. The first operation appears to be doing atomicAdd on the entire array pointed to by x. Then, after all that is done, you want to take the square root of each location. Presumably you want to wait for all the atomics to finish before taking the square root.

As you have formulated it, (assuming all that is true) you would need a grid-wide sync, and the kernel launch is giving you that. If you want to perform those operations in the same kernel, without other refactoring, you would need to investigate cooperative groups - specifically a cooperative grid launch.

However I don’t know the overall scope of your problem and I wouldn’t recommend major changes without some effort in profiling to convince yourself you know where the actual bottlenecks are.

There are other refactoring possibilities, as well. For example, the type of operation you are performing in the first step might be a kind of “scatter” operation: you start with a data set (perhaps a location in a matrix) and based on that value you update some other value in the output matrix. Sometimes “scatter” operations like this can be converted to an equivalent “gather” operation: you start with a location in the output matrix, and you collect all the inputs that affect that output. Such scatter->gather transformation can sometimes allow you to eliminate the use of atomics and “regularize” your code. If such a transformation were possible in your case, then tacking on a square-root operation to the end of the first operation would (I think) be trivial to do without another kernel launch.

I’m sure there are other refactoring possibilities as well.