How high priority stream preemption

Hello, I have some question about stream priority, can i get answers about these questions below?

How to decide kernel’s priority in preemption?
I found some CUDA preemption example codes, people used CUDA stream priority to test preemption. Only the CUDA stream priority decides the kernel’s priority? or in CUDA runtime, are there other mechanisms to decide which is the higher priority kernel in preemption?
In my practise, high priority stream kernels may not preempt low priority stream kernels immediately. Does high priority preenpt has some other conditions, such as sched slice or others?

Maxwell and Pascal arch has different behavior? And what is the different?

Thanks.

Don’t mix preemption and CUDA stream priority together.

The programmer has no direct control over preemption. The programmer has direct control over stream priority. Furthermore, stream priority activity does not imply that any preemption is occurring.

The documented forms of preemption take place in

  1. Certain debug situations https://docs.nvidia.com/cuda/cuda-gdb/index.html#single-gpu-debugging-with-desktop-manager-running
  2. Servicing of certain CDP (CUDA Dynamic Parallelism) patterns https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#sm-id-and-warp-id

Preemption may occur in other settings (e.g. in time-sliced servicing of multiple independent clients, i.e. multiple independent processes using a single GPU without MPS) but these are basically undocumented.

Preemption the way I am using it here (i.e. in the named, documented uses) refers to the idea that a threadblock that is running may be halted and moved out of an SM, before it retires.

Stream priority does not require preemption in order to deliver its stated function, and as far as I know there is no documentation that says that stream priority mechanism will use preemption in any setting.

(Yes, as noted below, the documentation now states that preemption may be used. may is not the same as will.)

Stream priority says that the GPU block scheduler, when depositing blocks to be run on various SMs, will choose blocks from higher priority streams before it chooses blocks from lower priority streams. The stream priority mechanism makes no claim that I am aware of, that a block that has been deposited on a SM will be preempted (i.e. removed from that SM) to make room for another block.

The basic CUDA (threadblock) execution model is that a threadblock, once deposited on a SM, will remain on that SM until it completes execution and retires.

Don’t mix preemption with stream priority. There is no valid or documented reason to do so. A kernel’s priority is decided according to the priority of the stream it is launched into. There is no other mechanism.

I don’t know what that means, it sounds strange

I don’t know what that means, it sounds strange

yes

Don’t mix preemption with stream priority. The CUDA runtime may preempt running kernels for certain CDP needs and in certain debug scenarios. Other uses of preemption by the CUDA runtime are undocumented AFAIK.

Don’t mix preemption with stream priority. High priority kernels will not necessarily preempt low priority kernels. Blocks from high priority kernels receive scheduling priority over blocks from low-priority kernels, but this only applies to blocks which have not yet been scheduled by the GPU block scheduler. Although the CUDA runtime may preempt low priority kernel blocks for high priority kernel blocks, there are no stated conditions under which such behavior is guaranteed, AFAIK.

The observation you describe is therefore certainly possible based on the kernel launch order, and what other intervening activity there may be. If a kernel is launched into a lower-priority stream first, and sometime later a kernel is launched into a higher priority stream, blocks from the higher priority kernel will not begin to execute until the GPU block scheduler finds available space to deposit them on the SM(s). CUDA provides no guarantee of preemption or any other mechanism in such a case to guarantee that blocks from the higher priority kernel will immediately execute upon launch.

Your question is unclear, but I don’t know how to respond in either case:

Maxwell and Pascal arch have different behavior from each other? Its not documented in any way that relates to stream priority that I am aware of.

Maxwell and Pascal arch have different behavior from other architectures? Its not documented in any way that relates to stream priority that I am aware of.

Notice the principal documentation of CUDA stream priority in the CUDA programming guide:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-priorities

The only runtime functional/behavioral description given is as follows:

“At runtime, as blocks in low-priority schemes finish, waiting blocks in higher-priority streams are scheduled in their place.”

This is a terse description of the functional behavior I have given. Note the use of the word “schemes” here is a typo, it should be “streams”.

4 Likes

Thanks for your detailed answer.
I have mixed preemption and CUDA stream priority together, thank you for your correction.

As your answer, blocks from the higher priority kernel will not begin to execute until the GPU block scheduler finds available space to deposit them on the SM(s).

Can we say the “available space” is SM resource?
If we launch kernels in the same stream, GPU block scheduler alloc “available space” for the blocks in different kernels in the same stream sequencely?

I have a profile result, and I can not explain well.Can you take a look at my profile data?
We have two thread, one thread launch kernels in high priority stream and then call cudaStreamSynchronize.
Another thread launch kernels in low priority stream an then call cudaStreamSynchronize.

Since we have two stream, low priority stream kernels execute sequencely, unless GPU block scheduler reserve “available space” for kernels in the same stream. The high priority kernel will execute quickly after the kernel launched.
But the rersult is the high priority kernel can not execute quickly after the kernel launched, it has a gap between launch time and execute start time. Can you tell me some reasons?

Attachment file is my profile result.

Thank you.

Yes. Registers, shared memory, warp slots, etc.

Kernels in the same stream cannot execute concurrently. That is due to stream semantics. Therefore the GPU block scheduler would not attempt to deal with a kernel in a particular stream, until all previous kernels (and all other previous stream activity) in that stream had completed. Yes, kernels in the same stream are always processed sequentially. That is basic stream semantics.

I just answered a question like this on stackoverflow:

https://stackoverflow.com/questions/57061559/concurrency-of-one-large-kernel-with-many-small-kernels-and-memcopys-cuda

  1. It should generally be unnecessary to use any synchronize calls in stream work issuance. Use the CUDA stream semantics to organize your activity.
  2. If you are working on a GPU in Windows WDDM driver mode, complex concurrency scenarios are very hard to arrange. I don’t even bother. Switch to linux or Windows TCC driver model for the GPU. (You appear to be on a Tesla P4, so this probably isn’t the case.)
  3. Note what I said in my answer there, and what I’ve said about block scheduling in this thread. If you, for example, launch a low-priority kernel that uses a grid-stride-loop and grid sizing to maximize occupancy, that could prevent a higher priority block from getting scheduled. Blocks must retire for higher priority blocks to get scheduled.

It’s not possible to say for sure just based on a profiler trace. The exact code is necessary also. Even if you provide your exact code, I’m not sure I would debug it for you. I’ve already given you a recipe, description, and fully-worked example in the answer to the stack overflow question I linked above.

The visual profiler also shows in the timeline the organization of work by stream. You’ve cut this out of the picture you posted. There might be some clues in there. Also, I haven’t tested my simple test case when work is issued by multiple threads. As a test, you might want to see if you can issue the work from the same thread. A static picture from nvvp doesn’t give me enough information.

Attachment image file is the timeline the organization of work by stream.

I notice that the kernel cuInt8::ncqhw4ToNchw launched many times, between the first ncqhw4ToNchw and the last ncqhw4ToNchw execute, the high priority kernel can not execute parallel. After last ncqhw4ToNchw launched, high priority kernel can execute parallel. I do not known if kernel ncqhw4ToNchw cause the problem.

Thanks.

Hi, I edit a test code as below.
Function T1Task::Proc() is used to launch kernels in two streams, low priority and high priority.
And RtDemo is a sample TensorRT context, which has only one layer implemented by TensorRT builtin Shuffle layer.
If I add

if (i == 0) {
  rt2.Infer(streams[0]);
}

Two stream kernels can not execute in parallel.
Because when before launch_kernel, I call rt.Infer(), and after launch_kernel, I call rt2.Infer() too.
I think two rt.Infer() prevent two stream execute concurrent.

for (int i = 0; i < max_streams; ++i) {
    if (i == 1) {
      //cudaStreamWaitEvent(streams[i], event, 0);
    }

    if (i == 0) {
      rt.Infer(streams[0]);
    }

    for (int k = 0; k < 10; k++) {
      launch_kernel(data[i], streams[i], 2160, 1024, N);
    }

    if (i == 0) {
      rt2.Infer(streams[0]);
    }
  }

If i cut

if (i == 0) {
  //rt2.Infer(streams[0]);
}

Two stream kernels can execute in parallel.
Attachment is my profile trace and code is as below.
rt_demo.h

#include <glog/logging.h>

#include <assert.h>

#include <sys/stat.h>
#include <time.h>
#include <cuda_runtime_api.h>
#include <float.h>
#include <string>
#include <vector>
#include <unordered_map>
#include <cmath>
#include <fstream>
#include <sstream>
#include <iostream>
#include <algorithm>
#include <iterator>

#include <NvInfer.h>
#include <NvCaffeParser.h>
#include <NvInferPlugin.h>

namespace nvinfer1 {
class ICaffePoolOutputDimensionsFormula : public IOutputDimensionsFormula {
 public:
  virtual DimsHW compute(DimsHW input_dims,
                         DimsHW kernel_size,
                         DimsHW stride,
                         DimsHW padding,
                         DimsHW dilation,
                         const char *layerName)
#if NV_TENSORRT_MAJOR < 5
                         override {
#else
                         const override {
#endif
    const int kernel_extent_h = dilation.d[0] * (kernel_size.d[0] - 1) + 1;
    const int kernel_extent_w = dilation.d[1] * (kernel_size.d[1] - 1) + 1;
    auto &&h_temp = (input_dims.d[0] + 2 * padding.d[0] - kernel_extent_h) *
                    1.0 / stride.d[0];
    auto &&w_temp = (input_dims.d[1] + 2 * padding.d[1] - kernel_extent_w) *
                    1.0 / stride.d[1];

    std::string str_name(layerName);
    if (str_name.find("as_conv") == std::string::npos) {
      return DimsHW(ceil(h_temp) + 1, ceil(w_temp) + 1);
    } else {
      return DimsHW(floor(h_temp) + 1, floor(w_temp) + 1);
    }
  }

  ICaffePoolOutputDimensionsFormula() {}
  ~ICaffePoolOutputDimensionsFormula() {}
};
}  // namespace

class RtDemo {
 public:
  RtDemo() {}

  ~RtDemo() {}

  int Init();

  int Infer(cudaStream_t stream);

 private:
  nvinfer1::IExecutionContext *guard_context_ = nullptr;
  nvinfer1::IBuilder *guard_builder_ = nullptr;
  nvinfer1::INetworkDefinition *guard_network_ = nullptr;
  std::vector<void *> guard_buffers_;
};

rt_demo.cpp

#include <vector>
#include <map>
#include <set>
#include <iostream>
#include <fstream>
#include <sstream>

#include <NvInferPlugin.h>

#include "rt_demo.h"

class RTLogger : public nvinfer1::ILogger {
  void log(Severity severity, const char *msg) override {
    if (severity != Severity::kINFO) {
      std::cout << msg << std::endl;
    }
  }
} rt_gLogger;

int RtDemo::Init() {
  // 1. create
  guard_builder_ = nvinfer1::createInferBuilder(rt_gLogger);
  guard_network_ = guard_builder_->createNetwork();
  nvinfer1::ICaffePoolOutputDimensionsFormula poolFormula;
  guard_network_->setPoolingOutputDimensionsFormula(&poolFormula);

  // 2. add input
  std::string input_data = "data";
  //nvinfer1::DimsCHW input_dims{720, 1280, 3};
  nvinfer1::DimsCHW input_dims{600, 800, 3};
  auto data = guard_network_->addInput(input_data.c_str(),
                                   nvinfer1::DataType::kFLOAT,
                                   input_dims);
  data->setName(input_data.c_str());

  // add tensorrt builtin shuffle layer
  std::string permute_name = "data_perm";
  nvinfer1::Dims dims;
  dims.nbDims = 3;
  for (int i = 0; i < dims.nbDims; ++i) {
   dims.d[i] = 0;
   dims.type[i] = nvinfer1::DimensionType::kSPATIAL;
  }
  dims.type[0] = nvinfer1::DimensionType::kCHANNEL;

  // HWC -> CHW: [2, 0, 1], CHW -> HWC: [1, 2, 0]
  nvinfer1::Permutation perm;
  perm.order[0] = 2;
  perm.order[1] = 0;
  perm.order[2] = 1;

  for (int i = dims.nbDims; i < nvinfer1::Dims::MAX_DIMS; ++i) {
    perm.order[i] = 0;
  }

  nvinfer1::IShuffleLayer *shuffleLayer = guard_network_->addShuffle(*data);
  shuffleLayer->setFirstTranspose(perm);
  shuffleLayer->setName(permute_name.c_str());
  shuffleLayer->setReshapeDimensions(dims);
  shuffleLayer->getOutput(0)->setName(permute_name.c_str());

  guard_network_->markOutput(*shuffleLayer->getOutput(0));

  // 3. build engine
  guard_builder_->setMaxBatchSize(1);
  int work_space_size = 1 << 30;
  guard_builder_->setMaxWorkspaceSize(work_space_size);

  guard_builder_->setInt8Mode(false);
  guard_builder_->setInt8Calibrator(nullptr);

  guard_builder_->setDebugSync(true);

  std::cout << "Building cuda engine ...";

  nvinfer1::ICudaEngine *engine = nullptr;
  engine = guard_builder_->buildCudaEngine(*guard_network_);

  guard_context_ = engine->createExecutionContext();
  guard_buffers_.resize(2); // 1 input + 1 output

  void *gpu_ptr;
  cudaMalloc(&gpu_ptr, 16 << 20);
  guard_buffers_[0] = gpu_ptr;

  cudaMalloc(&gpu_ptr, 16 << 20);
  guard_buffers_[1] = gpu_ptr;
  if (guard_context_) {
    std::cout << "init guard success" << std::endl;
  }
  return true;
}

int RtDemo::Infer(cudaStream_t stream) {
  guard_context_->enqueue(1, &guard_buffers_[0], stream, nullptr);
  return 0;
}

And the test code

#include <pthread.h>
#include <stdio.h>
#include <glog/logging.h>
#include "t1_task.h"
#include "rt_demo.h"

const int N = (1 << 20);
void *launch_kernel(float*data, cudaStream_t stream, int blocks, int threads, int n);
void *launch_kernel_default(float*data, int blocks, int threads, int n);

bool T1Task::Preprocess() {
  cudaSetDevice(0);
  max_streams = 2;

  int leastPriority = 0;
  int greatestPriority = 0;
  cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority);
  LOG(INFO) << "leastPriority: " << leastPriority << ", greatestPriority: " << greatestPriority;

  priority_ = -1;
  for (int i = 0; i < max_streams; ++i) {
    if (i == 1) {
      LOG(INFO) << "this set priority##############:" << priority_;
      cudaStreamCreateWithPriority(&streams[i], cudaStreamNonBlocking, priority_);//cudaStreamNonBlocking);
    }  else {
      cudaStreamCreateWithPriority(&streams[i], cudaStreamNonBlocking, 0);//cudaStreamNonBlocking);
    }

    cudaMalloc(&data[i], N * sizeof(float));
    cudaMalloc(&data[i + max_streams], N * sizeof(float));
  }
}

bool T1Task::Proc() {
  RtDemo rt;
  RtDemo rt2;
  rt.Init();
  rt2.Init();

  int threads = threads_;
  for (int i = 0; i < max_streams; ++i) {
    if (i == 1) {
      //cudaStreamWaitEvent(streams[i], event, 0);
    }

    if (i == 0) {
      rt.Infer(streams[0]);
    }

    for (int k = 0; k < 10; k++) {
      launch_kernel(data[i], streams[i], 2160, 1024, N);
    }

    if (i == 0) {
      //rt2.Infer(streams[0]);
    }
  }

  for (int i = 0; i < max_streams; ++i) {
    cudaStreamSynchronize(streams[i]);
  }
  return true;
}

Do you know the reasons? Thank you.


Hi @hill_brook, @Robert_Crovella,
I stumbled on this question while looking for information on CUDA stream priorities, and I have a follow up question.

and

On the other hand, the documentation of cudaStreamCreateWithPriority in the CUDA Runtime API states (emphasis mine)

Description
Creates a stream with the specified priority and returns a handle in pStream. This API alters the scheduler priority of work in the stream. Work in a higher priority stream may preempt work already executing in a low priority stream.

Did the behaviour (or the documentation) change in a more recent CUDA version and/or hardware architecture ?

Or, does “preempt” mean something different here ?

Thank you for any clarification,
.Andrea

Same problem here regarding the preempt mentioned in the CUDA API doc.

Which is? Please clarify.

The description of cudaStreamCreateWithPriority in CUDA docs says

Work in a higher priority stream may preempt work already executing in a low priority stream.

using preempt. I am wondering if there is a behavior change or if the preempt is misused.

Sorry, I am not following.

(1) What indications are there that “preempt” may be used incorrectly here?
(2) What indications are there that a behavior change occurred (and if so, over which period of time)?

Going back to an earlier post in this thread (and emphasizing what I believe are the operative words):

The programmer has no direct control over preemption […] Work in a higher priority stream may preempt work already executing in a low priority stream

That seems correct and non-contradicting to me.

I’ve defined my usage of the word preemption already in this thread. The CUDA runtime is free to preempt any code, at any time, for any reason. There are no residency guarantees.

As already indicated, the CUDA docs state that the runtime may use preemption along with stream priority. I don’t think that statement really conflicts with anything I’ve said already. Mentally, I separate preemption and stream priority. I don’t think its good to conflate the two.

CUDA may use preemption with stream priority. As far as I know, the usage of stream priority does not guarantee that anything will be preempted. That doesn’t rule out the possibility of preemption, either.

Thank @Robert_Crovella and @njuffa for the clarification.

My own question is if CUDA runtime could preempt a running kernel from a low-priority stream for a newly launched kernel from a high-priority stream.

The original confusion of my own came from the original version of your reply:

Blocks from high priority kernels receive scheduling priority over blocks from low-priority kernels, but this only applies to blocks which have not yet been scheduled by the GPU block scheduler.

and

High priority kernels do not preempt low priority kernels.

I misunderstood that the stream priority only implies a priority in the GPU scheduler before kernel execution, which is not consistent with the API docs.

And according to your update and clarification, I know that it is possible (and no guarantee) for CUDA runtime to preempt an existing kernel due to stream priority and some other reasons.