Performance of persistent thread approach on new gpu architectures

Hi all,

The persistent threads approach is published 2012[1], as you may know. My question is about its performance. Modern GPUs (kepler and more) involves quite efficient SM that is able to schedule cuda threads very well. I’m wondering that how would be efficient to use persistent thread approach onto kepler+ architectures ? If it shows slowdown for modern GPUs, why exactly?

Thank you in advance for useful answers.

[1]K. Gupta, J. A. Stuart, and J. D. Owens, “A study of persistent threads style gpu programming for gpgpu workloads,” in Innovative Parallel Computing (InPar), 2012, pp. 1–14, IEEE, 2012.

persistent threads approach is mostly orthogonal to the underlying hardware. It solves problems, and attempts to provide programming methodologies, that are extending any current GPU architecture.

Those methodologies will not go away anytime soon.

thank you txbob. You are right, PT (persistence thread) is definitely useful approach to re-use on chip cache and register without cpu intervention.

Actually, I’d like to extend my question then. Let’s say I’m not re-using on chip cache. I only want to use globalSync, in order to use that, I should use smaller thread hierarchy for PT. My application consists two work pieces (work1 -> globalSync -> work2). These two work pieces are using same data from global memory.

  • Option 1-) By using PT, I can run all my works in one kernel.
  • Option 2-) Without PT, I can do kernel1 (work1) -> kernel2(work2). I'd expect that cuda compiler will arrange registers number better for different kernels. Also, I create bigger kernel size as I'm not restricted with persistence thread
  • My question is which option better in this kind of circumstances ?

    Shortly, Persistence thread block limits me to create number of CTA because CTA number must be equal to physical SM. However if I create more CTA, I expect hardware scheduler works better than PT approach.

    I personally would not use persistent threads unless there was a specific reason to do so (and generally, “performance” is not specific enough).

    You’ve already listed some specific reasons:

    1. Necessity of doing in-kernel global sync
    2. Desire to avoid kernel launch latency
    3. As a way to implement a producer-consumer model
    4. Optimization of register/on-chip storage re-use (e.g. in the persistent nn example)
      (and I’m sure there are other possibilities)

    Persistent threads are one possible way to address each of the above concepts, but not the only way. Furthermore, PT cause (force) the programmer to walk a fine line of having enough work (e.g. threads) available for the GPU to hide latency, while not too many threads so as to exceed the limit imposed by PT.

    For example, if my algorithm is amenable to global sync using the kernel launch and ordinary structure of GPU work, I would certainly start with that, before diving into a PT model.

    It’s not possible to answer overly general questions as to “which is better”.

    In the final analysis, the burden is on you, the programmer, to prove which is better, for your specific case.

    It sounds neat but I doubt it’d be better than just flat out restructuring your existing algorithm to avoid things like the need for global sync points.

    Thank you very much for your answers. The reason of using PT is to deal with specific issue what I have already needed. Of course, It’s not the only solution to manage that such problems, I can change algorithm definitely, but I would like to understand difference between PT and nonPT. That’s why your comments are pretty important and useful for me to understand difference.

    The fact that PT imposed kernel hierarchy with physical SM and threads. In order clarify in my mind, I’d like talk with example. Let’s say, I am using Titan-X has 24 SM and each SM has 128 cuda cores.
    [nonPT solution] My current application

    1. work1 <>>
    2. cpu_implicit_sync
    3. work2 <>>

    [PT solution], I changed kernel hiarhacy and I merged two works into kernel workMEGA

  • workMEGA <>>
  • My last questions are:

    1. Is thread number (128) right for workMEGA as SMM can host only 128 thread ? Can I use 1024 ?
    2. Will I loose performance somehow because I'm not taking advantage of hardware scheduler ?
    3. For nonPT example, compiler is able optimize registers very well. But PT solution will have large kernel, so I expect number of register will be higher. How does it effect ?

    Thank you very much for your comments.

    1. You can use 1024. You can also do an occupancy analysis to possibly raise the block count above 24. For example, if your occupancy analysis indicates you can launch 2048 threads per SM (the hardware limit) it might be possible to run with 48 blocks of 1024 threads. The occupancy analysis could also suggest that you could run with 72 blocks of 512 threads, as another possible example (1536 threads per SM)

    2. You’re still using the hardware scheduler. Even with only 128 threads per block = 4 warps per block/sm, not all of those warps will have instructions selected on every cycle. But you can improve the scenario and the hardware scheduler’s ability to hide latency by giving it the maximum complement of warps to choose from, which might be 32 or more (see 1 above).

    3. It’s impossible to say without knowing the actual magnitudes. I would compile the codes with -Xptxas -v to have the compiler spit out register usage and spilling. If there is no spilling in either case, then the register usage should not be an issue. If there is spilling in one case or the other, you will have to try to judge the effect of that. Register usage is also one of the factors for occupancy analysis I mention in 1 above.

    Thank you very much txbob.
    The very last question then :) What kind of side effects can decrease my performance of PT solution?

    Btw, yet another question :) when I use simple gpu_sync for inter-block synchronization as I show, does it work in case of collision in GPU ?

    //the global mutex for global barrier function
    volatile __device__ int g_mutex=0;
     __device__ void __gpu_sync_bad(int blocks_to_synch)
        //thread ID in a block
        int tid_in_block= threadIdx.x;
        // only thread 0 is used for synchronization
        if (tid_in_block == 0)
            atomicAdd((int *)&g_mutex, 1);
            //only when all blocks add 1 to g_mutex will
            //g_mutex equal to blocks_to_synch
            while(g_mutex < blocks_to_synch);

    Are there any open-source frameworks available implementing persistent thread approaches, especially for RNNs ?
    At no source code is provided. There seem to be related general-purpose frameworks like ‘WhippleTree’ ( , source code at ), but I am not sure whether it is efficient for implementing persistent RNNs.

    Paper and Code for persistent rnns was just released by the Baidu guys, nice.

    Hi Robert, What’s the differences between the kernel fusion and persistent thread?

    Also, pursuant to previous discussions in this thread, the modern design technique for a persistent kernel is to use cuda cooperative groups in a cooperative kernel launch, instead of the calculations and considerations as they are laid out in comments 7 and 8 above. However the end point is similar (a limit to the number of threads and blocks).

    Hi Robet, Do you have an example of cooperative groups? I cannot classify the differences between the persistent kernel and cooperative groups.

    There are cooperative groups sample codes in the CUDA sample codes. Any sample code with CG in is using cooperative groups. The conjugateGradientMultiBlockCG sample code demonstrates grid-wide sync.

    Hi Robert, between these two ideas, kernel fusion and persistent thread, which method requires more GPU memory?

    I don’t know the answer to that. I’m sure it depends on the actual implementation.

    Hi, Robert. I think I found the answer from this paper, though it is not implemented on GPU. It depends on the implementation. It is a trade-off between memory cost and data transfer.