How high priority stream preemption

Hello, I have some question about stream priority, can i get answers about these questions below?

How to decide kernel’s priority in preemption?
I found some CUDA preemption example codes, people used CUDA stream priority to test preemption. Only the CUDA stream priority decides the kernel’s priority? or in CUDA runtime, are there other mechanisms to decide which is the higher priority kernel in preemption?
In my practise, high priority stream kernels may not preempt low priority stream kernels immediately. Does high priority preenpt has some other conditions, such as sched slice or others?

Maxwell and Pascal arch has different behavior? And what is the different?


Don’t mix preemption and CUDA stream priority together.

The programmer has no direct control over preemption. The programmer has direct control over stream priority. Furthermore, stream priority activity does not imply that any preemption is occurring.

The documented forms of preemption take place in

  1. Certain debug situations
  2. Servicing of certain CDP (CUDA Dynamic Parallelism) patterns

Preemption may occur in other settings (e.g. in time-sliced servicing of multiple independent clients, i.e. multiple independent processes using a single GPU without MPS) but these are basically undocumented.

Preemption the way I am using it here (i.e. in the named, documented uses) refers to the idea that a threadblock that is running may be halted and moved out of an SM, before it retires.

Stream priority does not require preemption in order to deliver its stated function, and as far as I know there is no documentation that says that stream priority mechanism will use preemption in any setting.

Stream priority says that the GPU block scheduler, when depositing blocks to be run on various SMs, will choose blocks from higher priority streams before it chooses blocks from lower priority streams. The stream priority mechanism makes no claim that I am aware of, that a block that has been deposited on a SM will be preempted (i.e. removed from that SM) to make room for another block.

The basic CUDA (threadblock) execution model is that a threadblock, once deposited on a SM, will remain on that SM until it completes execution and retires.

Don’t mix preemption with stream priority. There is no valid or documented reason to do so. A kernel’s priority is decided according to the priority of the stream it is launched into. There is no other mechanism.

I don’t know what that means, it sounds strange

I don’t know what that means, it sounds strange


Don’t mix preemption with stream priority. The CUDA runtime may preempt running kernels for certain CDP needs and in certain debug scenarios. Other uses of preemption by the CUDA runtime are undocumented AFAIK.

Don’t mix preemption with stream priority. High priority kernels do not preempt low priority kernels. Blocks from high priority kernels receive scheduling priority over blocks from low-priority kernels, but this only applies to blocks which have not yet been scheduled by the GPU block scheduler.

The observation you describe is therefore certainly possible based on the kernel launch order, and what other intervening activity there may be. If a kernel is launched into a lower-priority stream first, and sometime later a kernel is launched into a higher priority stream, blocks from the higher priority kernel will not begin to execute until the GPU block scheduler finds available space to deposit them on the SM(s). CUDA provides no guarantee of preemption or any other mechanism in such a case to guarantee that blocks from the higher priority kernel will immediately execute upon launch.

Your question is unclear, but I don’t know how to respond in either case:

Maxwell and Pascal arch have different behavior from each other? Its not documented in any way that relates to stream priority that I am aware of.

Maxwell and Pascal arch have different behavior from other architectures? Its not documented in any way that relates to stream priority that I am aware of.

Notice the principal documentation of CUDA stream priority in the CUDA programming guide:

The only runtime functional/behavioral description given is as follows:

“At runtime, as blocks in low-priority schemes finish, waiting blocks in higher-priority streams are scheduled in their place.”

This is a terse description of the functional behavior I have given. Note the use of the word “schemes” here is a typo, it should be “streams”.

Thanks for your detailed answer.
I have mixed preemption and CUDA stream priority together, thank you for your correction.

As your answer, blocks from the higher priority kernel will not begin to execute until the GPU block scheduler finds available space to deposit them on the SM(s).

Can we say the “available space” is SM resource?
If we launch kernels in the same stream, GPU block scheduler alloc “available space” for the blocks in different kernels in the same stream sequencely?

I have a profile result, and I can not explain well.Can you take a look at my profile data?
We have two thread, one thread launch kernels in high priority stream and then call cudaStreamSynchronize.
Another thread launch kernels in low priority stream an then call cudaStreamSynchronize.

Since we have two stream, low priority stream kernels execute sequencely, unless GPU block scheduler reserve “available space” for kernels in the same stream. The high priority kernel will execute quickly after the kernel launched.
But the rersult is the high priority kernel can not execute quickly after the kernel launched, it has a gap between launch time and execute start time. Can you tell me some reasons?

Attachment file is my profile result.

Thank you.


Yes. Registers, shared memory, warp slots, etc.

Kernels in the same stream cannot execute concurrently. That is due to stream semantics. Therefore the GPU block scheduler would not attempt to deal with a kernel in a particular stream, until all previous kernels (and all other previous stream activity) in that stream had completed. Yes, kernels in the same stream are always processed sequentially. That is basic stream semantics.

I just answered a question like this on stackoverflow:

  1. It should generally be unnecessary to use any synchronize calls in stream work issuance. Use the CUDA stream semantics to organize your activity.
  2. If you are working on a GPU in Windows WDDM driver mode, complex concurrency scenarios are very hard to arrange. I don’t even bother. Switch to linux or Windows TCC driver model for the GPU. (You appear to be on a Tesla P4, so this probably isn’t the case.)
  3. Note what I said in my answer there, and what I’ve said about block scheduling in this thread. If you, for example, launch a low-priority kernel that uses a grid-stride-loop and grid sizing to maximize occupancy, that could prevent a higher priority block from getting scheduled. Blocks must retire for higher priority blocks to get scheduled.

It’s not possible to say for sure just based on a profiler trace. The exact code is necessary also. Even if you provide your exact code, I’m not sure I would debug it for you. I’ve already given you a recipe, description, and fully-worked example in the answer to the stack overflow question I linked above.

The visual profiler also shows in the timeline the organization of work by stream. You’ve cut this out of the picture you posted. There might be some clues in there. Also, I haven’t tested my simple test case when work is issued by multiple threads. As a test, you might want to see if you can issue the work from the same thread. A static picture from nvvp doesn’t give me enough information.

Attachment image file is the timeline the organization of work by stream.

I notice that the kernel cuInt8::ncqhw4ToNchw launched many times, between the first ncqhw4ToNchw and the last ncqhw4ToNchw execute, the high priority kernel can not execute parallel. After last ncqhw4ToNchw launched, high priority kernel can execute parallel. I do not known if kernel ncqhw4ToNchw cause the problem.


Hi, I edit a test code as below.
Function T1Task::Proc() is used to launch kernels in two streams, low priority and high priority.
And RtDemo is a sample TensorRT context, which has only one layer implemented by TensorRT builtin Shuffle layer.
If I add

if (i == 0) {

Two stream kernels can not execute in parallel.
Because when before launch_kernel, I call rt.Infer(), and after launch_kernel, I call rt2.Infer() too.
I think two rt.Infer() prevent two stream execute concurrent.

for (int i = 0; i < max_streams; ++i) {
    if (i == 1) {
      //cudaStreamWaitEvent(streams[i], event, 0);

    if (i == 0) {

    for (int k = 0; k < 10; k++) {
      launch_kernel(data[i], streams[i], 2160, 1024, N);

    if (i == 0) {

If i cut

if (i == 0) {

Two stream kernels can execute in parallel.
Attachment is my profile trace and code is as below.

#include <glog/logging.h>

#include <assert.h>

#include <sys/stat.h>
#include <time.h>
#include <cuda_runtime_api.h>
#include <float.h>
#include <string>
#include <vector>
#include <unordered_map>
#include <cmath>
#include <fstream>
#include <sstream>
#include <iostream>
#include <algorithm>
#include <iterator>

#include <NvInfer.h>
#include <NvCaffeParser.h>
#include <NvInferPlugin.h>

namespace nvinfer1 {
class ICaffePoolOutputDimensionsFormula : public IOutputDimensionsFormula {
  virtual DimsHW compute(DimsHW input_dims,
                         DimsHW kernel_size,
                         DimsHW stride,
                         DimsHW padding,
                         DimsHW dilation,
                         const char *layerName)
                         override {
                         const override {
    const int kernel_extent_h = dilation.d[0] * (kernel_size.d[0] - 1) + 1;
    const int kernel_extent_w = dilation.d[1] * (kernel_size.d[1] - 1) + 1;
    auto &&h_temp = (input_dims.d[0] + 2 * padding.d[0] - kernel_extent_h) *
                    1.0 / stride.d[0];
    auto &&w_temp = (input_dims.d[1] + 2 * padding.d[1] - kernel_extent_w) *
                    1.0 / stride.d[1];

    std::string str_name(layerName);
    if (str_name.find("as_conv") == std::string::npos) {
      return DimsHW(ceil(h_temp) + 1, ceil(w_temp) + 1);
    } else {
      return DimsHW(floor(h_temp) + 1, floor(w_temp) + 1);

  ICaffePoolOutputDimensionsFormula() {}
  ~ICaffePoolOutputDimensionsFormula() {}
}  // namespace

class RtDemo {
  RtDemo() {}

  ~RtDemo() {}

  int Init();

  int Infer(cudaStream_t stream);

  nvinfer1::IExecutionContext *guard_context_ = nullptr;
  nvinfer1::IBuilder *guard_builder_ = nullptr;
  nvinfer1::INetworkDefinition *guard_network_ = nullptr;
  std::vector<void *> guard_buffers_;


#include <vector>
#include <map>
#include <set>
#include <iostream>
#include <fstream>
#include <sstream>

#include <NvInferPlugin.h>

#include "rt_demo.h"

class RTLogger : public nvinfer1::ILogger {
  void log(Severity severity, const char *msg) override {
    if (severity != Severity::kINFO) {
      std::cout << msg << std::endl;
} rt_gLogger;

int RtDemo::Init() {
  // 1. create
  guard_builder_ = nvinfer1::createInferBuilder(rt_gLogger);
  guard_network_ = guard_builder_->createNetwork();
  nvinfer1::ICaffePoolOutputDimensionsFormula poolFormula;

  // 2. add input
  std::string input_data = "data";
  //nvinfer1::DimsCHW input_dims{720, 1280, 3};
  nvinfer1::DimsCHW input_dims{600, 800, 3};
  auto data = guard_network_->addInput(input_data.c_str(),

  // add tensorrt builtin shuffle layer
  std::string permute_name = "data_perm";
  nvinfer1::Dims dims;
  dims.nbDims = 3;
  for (int i = 0; i < dims.nbDims; ++i) {
   dims.d[i] = 0;
   dims.type[i] = nvinfer1::DimensionType::kSPATIAL;
  dims.type[0] = nvinfer1::DimensionType::kCHANNEL;

  // HWC -> CHW: [2, 0, 1], CHW -> HWC: [1, 2, 0]
  nvinfer1::Permutation perm;
  perm.order[0] = 2;
  perm.order[1] = 0;
  perm.order[2] = 1;

  for (int i = dims.nbDims; i < nvinfer1::Dims::MAX_DIMS; ++i) {
    perm.order[i] = 0;

  nvinfer1::IShuffleLayer *shuffleLayer = guard_network_->addShuffle(*data);


  // 3. build engine
  int work_space_size = 1 << 30;



  std::cout << "Building cuda engine ...";

  nvinfer1::ICudaEngine *engine = nullptr;
  engine = guard_builder_->buildCudaEngine(*guard_network_);

  guard_context_ = engine->createExecutionContext();
  guard_buffers_.resize(2); // 1 input + 1 output

  void *gpu_ptr;
  cudaMalloc(&gpu_ptr, 16 << 20);
  guard_buffers_[0] = gpu_ptr;

  cudaMalloc(&gpu_ptr, 16 << 20);
  guard_buffers_[1] = gpu_ptr;
  if (guard_context_) {
    std::cout << "init guard success" << std::endl;
  return true;

int RtDemo::Infer(cudaStream_t stream) {
  guard_context_->enqueue(1, &guard_buffers_[0], stream, nullptr);
  return 0;

And the test code

#include <pthread.h>
#include <stdio.h>
#include <glog/logging.h>
#include "t1_task.h"
#include "rt_demo.h"

const int N = (1 << 20);
void *launch_kernel(float*data, cudaStream_t stream, int blocks, int threads, int n);
void *launch_kernel_default(float*data, int blocks, int threads, int n);

bool T1Task::Preprocess() {
  max_streams = 2;

  int leastPriority = 0;
  int greatestPriority = 0;
  cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority);
  LOG(INFO) << "leastPriority: " << leastPriority << ", greatestPriority: " << greatestPriority;

  priority_ = -1;
  for (int i = 0; i < max_streams; ++i) {
    if (i == 1) {
      LOG(INFO) << "this set priority##############:" << priority_;
      cudaStreamCreateWithPriority(&streams[i], cudaStreamNonBlocking, priority_);//cudaStreamNonBlocking);
    }  else {
      cudaStreamCreateWithPriority(&streams[i], cudaStreamNonBlocking, 0);//cudaStreamNonBlocking);

    cudaMalloc(&data[i], N * sizeof(float));
    cudaMalloc(&data[i + max_streams], N * sizeof(float));

bool T1Task::Proc() {
  RtDemo rt;
  RtDemo rt2;

  int threads = threads_;
  for (int i = 0; i < max_streams; ++i) {
    if (i == 1) {
      //cudaStreamWaitEvent(streams[i], event, 0);

    if (i == 0) {

    for (int k = 0; k < 10; k++) {
      launch_kernel(data[i], streams[i], 2160, 1024, N);

    if (i == 0) {

  for (int i = 0; i < max_streams; ++i) {
  return true;

Do you know the reasons? Thank you.

Hi @hill_brook, @Robert_Crovella,
I stumbled on this question while looking for information on CUDA stream priorities, and I have a follow up question.


On the other hand, the documentation of cudaStreamCreateWithPriority in the CUDA Runtime API states (emphasis mine)

Creates a stream with the specified priority and returns a handle in pStream. This API alters the scheduler priority of work in the stream. Work in a higher priority stream may preempt work already executing in a low priority stream.

Did the behaviour (or the documentation) change in a more recent CUDA version and/or hardware architecture ?

Or, does “preempt” mean something different here ?

Thank you for any clarification,