Is there no way to let global memory reside in local memory after a kernel ends?


I’ve been working on numerical analysis such as solving massive sparse linear system.
Now I’m doing optimizations on BiCGSTAB, a numerical algorithm to solve such systems.
My problems require several hundreds of BiCGSTAB iterations, and each iteration contains two matrix-vector multiplication.

void BiCGSTAB(...) {

do {

matrix_vector_mul <<< ... >>> (...);

dot_product <<<...>>> (...);

matrix_vector_mul <<< ... >>> (...);

vector_arithmetic <<<...>>> (...);


} while (converged);

The biggest bottleneck in this situation is memory bound; about 70% of execution time is taken by global memory fetching.
My opinion was, the matrix used in a single BiCGSTAB does not change, so if the matrix memory can be resided in much faster memory, not in global memory, significant speedup would be achieved.

In my conclusion, there are two ways to implement this.

The first one is merging all kernels into a single one like the followings.

__global__ void BiCGSTAB_KERNEL(...) {

//...fetch memory to register

do {


// <b>*global sync needed</b>


// <b>*global sync needed</b>


// <b>*global sync needed</b>



} while(converged);


There are several kernels used in the algorithm such as matrix-vector multiplication, dot product and vector arithmetic routines. Since global thread synchronization should be assured among those procedures, they are all split. If I could merge all the kernels, the matrix data would be stored in register memory, then memory bottleneck would be eliminated. However, merging kernels requires developing inter-block synchronization in GPU kernels, which CUDA does not support in Pascal GPU. I tried some alternatives, but all failed in large number( > 2^15) of threads.

The second one is to control context in GPU, but since I’m not a computer science professional, I cannot find any way to do that and even possibility to do that.

So my questions are

  1. Is there any way to implement global synchronization or synchronize all threads in a global kernel?
  2. Is it possible to preserve or save a context which contain register memory?


  1. cooperative groups
  2. the only way to do it is to have a single kernel or persistent kernel

Hello Robert. Thanks for the answer.

I know that the cooperative group feature enables synchronization in a whole grid. But what I have in question is whether there are limitations on resources to use the feature. It’s because in ‘conjugate gradient’ sample in cuda advanced sample, the code does not load global memory to register but uses the global memory as it is. Since my code deals with a pretty massive system (rank of matrices is 0.1 million ~ 1 million) and what I want to do is to load all the matrix data in register, if there’s such restrictions on resources, it would not be an attractive solution.

And for the second answer, I have never heard the term “persistent kernel”. Googling shows me another term “persistent thread” instead of the former. Can you explain it or provide materials about concept of it?

Thanks in advanve.

There are limitations required to use cooperative groups (at least, cooperative groups for grid-wide sync, anyway). I’m not sure I would call these resource limitations, because a carefully designed cooperative kernel launch can still take advantage of all GPU “resources”, eg. registers, SM thread capacity, shared memory, etc.

If you google “cuda persistent kernel” you will certainly find writeups on the topic, including those that I have written e.g. here and on StackOverflow. Persistent threads and persistent kernels are the same topic, at a high level.

Dear Robert.

Thanks for the advise. I tried a test, in which I used only a single block with 1024 threads per block, and used __syncthreads() to globally synchronize in a single block. While a thread takes a single row of a large matrix in split kernel version BiCGSTAB, a thread is in charge of a multiple rows in the test version solver. Then in small problems whose ranks are about 10 thousands, it shows significant speedup compared to the split version. (Time measurement improved from 20 seconds to 3 seconds.) So I anticipate that if I can use more blocks and utilize inter-block synchronization in the cooperative group feature, then it will fully accelerate a larger problem of which rank is 1 million.

Thanks again for your answer.