Hi.

I’ve been working on numerical analysis such as solving massive sparse linear system.

Now I’m doing optimizations on BiCGSTAB, a numerical algorithm to solve such systems.

My problems require several hundreds of BiCGSTAB iterations, and each iteration contains two matrix-vector multiplication.

```
void BiCGSTAB(...) {
do {
matrix_vector_mul <<< ... >>> (...);
dot_product <<<...>>> (...);
matrix_vector_mul <<< ... >>> (...);
vector_arithmetic <<<...>>> (...);
...
} while (converged);
}
```

The biggest bottleneck in this situation is memory bound; about 70% of execution time is taken by global memory fetching.

My opinion was, the matrix used in a single BiCGSTAB does not change, so if the matrix memory can be resided in much faster memory, not in global memory, significant speedup would be achieved.

In my conclusion, there are two ways to implement this.

The first one is merging all kernels into a single one like the followings.

```
__global__ void BiCGSTAB_KERNEL(...) {
//...fetch memory to register
do {
matrix_vector_multiplication(...);
// <b>*global sync needed</b>
dot_product(...);
// <b>*global sync needed</b>
matrix_vector_multiplication(...);
// <b>*global sync needed</b>
vector_arithmetic(...);
...
} while(converged);
}
```

There are several kernels used in the algorithm such as matrix-vector multiplication, dot product and vector arithmetic routines. Since global thread synchronization should be assured among those procedures, they are all split. If I could merge all the kernels, the matrix data would be stored in register memory, then memory bottleneck would be eliminated. However, merging kernels requires developing inter-block synchronization in GPU kernels, which CUDA does not support in Pascal GPU. I tried some alternatives, but all failed in large number( > 2^15) of threads.

The second one is to control context in GPU, but since I’m not a computer science professional, I cannot find any way to do that and even possibility to do that.

So my questions are

- Is there any way to implement global synchronization or synchronize all threads in a global kernel?
- Is it possible to preserve or save a context which contain register memory?

Thanks.