Often in kernels I’m launching two or three-dimensional grids where only part of the calculations in the kernel depend on all the available dimensions. As in this example kernel:

```
__global__ void AllInOneKernel(double *x, double *SomeVariable, double *SomeOtherVariable, const size_t k, const size_t n) {
for (int v = blockIdx.x * blockDim.x + threadIdx.x; v < k; v += blockDim.x * gridDim.x) {
// Computations in x and y direction.
for (int row = blockIdx.y * blockDim.y + threadIdx.y; row < n; row += blockDim.y * gridDim.y) {
atomicAdd(x + v, SomeVariable[row]);
}
// Computations in only x direction.
x[v] = sqrt(x[v]);
// Computations in x and y direction again.
for (int row = blockIdx.y * blockDim.y + threadIdx.y; row < n; row += blockDim.y * gridDim.y) {
SomeOtherVariable[row + n*v] *= x[v];
}
}
}
```

First I need to use a two-dimensional grid, then temporarily put the threads in the y-direction on hold while I continue to use the x-direction and then lastly use the two-dimensional grid again.

If I launch the kernel in this way, the results I get are incorrect. If I instead partition the calculations into three separate kernels in this way:

```
__global__ void Kernel1(double *x, double *SomeVariable, double *SomeOtherVariable, const size_t k, const size_t n) {
for (int v = blockIdx.x * blockDim.x + threadIdx.x; v < k; v += blockDim.x * gridDim.x) {
// Computations in x and y direction.
for (int row = blockIdx.y * blockDim.y + threadIdx.y; row < n; row += blockDim.y * gridDim.y) {
atomicAdd(x + v, SomeVariable[row]);
}
}
}
__global__ void Kernel2(double *x, double *SomeVariable, double *SomeOtherVariable, const size_t k, const size_t n) {
for (int v = blockIdx.x * blockDim.x + threadIdx.x; v < k; v += blockDim.x * gridDim.x) {
// Computations in only x direction.
x[v] = sqrt(x[v]);
}
}
__global__ void Kernel3(double *x, double *SomeVariable, double *SomeOtherVariable, const size_t k, const size_t n) {
for (int v = blockIdx.x * blockDim.x + threadIdx.x; v < k; v += blockDim.x * gridDim.x) {
// Computations in x and y direction again.
for (int row = blockIdx.y * blockDim.y + threadIdx.y; row < n; row += blockDim.y * gridDim.y) {
SomeOtherVariable[row + n*v] *= x[v];
}
}
}
```

The results are correct. But since there is an overhead associated with launching each kernel I’m guessing it’s not a great idea to divide my calculations across a myriad of kernels like this.

So what I wonder is: how can I change the first kernel to produce the correct results without the need for three kernels? I’ve tried using __syncthreads() between the calculations but it doesn’t help. How is this problem usually solved?