My understanding of cudaGraphInstantiateFlagUseNodePriority
is to prioritize kernel calls.
i.e. we have 3 independent kernels in cudaGraph first, second & third and let’s each kernel waits for 1 s and print its name. If we set kernel graph node priority using cudaGraphKernelNodeSetAttribute (attr name - cudaLaunchAttributePriority) for each as 0, 1, 2. When graph is executed then it should honor priority i.e. third should be called followed by second and followed by first.
Another thing to note is that after setting priority or kernel graph node, If I try to confirm using cudaGraphKernelNodeGet Attribute (attr name - cudaLaunchAttributePriority) , I always get priority as 0. It should return same value set by Set call as mentioned previously … Right?
Please correct if I am wrong.
#include <cuda_runtime.h>
#include <vector>
#include <cstdio>
#include <chrono>
#define CUDA_CHECK(error) \
{ \
cudaError_t localError = error; \
if (localError != cudaSuccess) { \
printf("error: '%s'(%d) from %s at %s:%d\n", cudaGetErrorString(localError), \
localError, #error, __FUNCTION__, __LINE__); \
exit(0);\
} \
}
__global__ void first(uint32_t interval, const uint32_t ticks_per_ms) {
size_t globaltid = blockIdx.x * blockDim.x + threadIdx.x;
if (globaltid == 0) {
printf("\nfirst..");
while (interval--) {
uint64_t start = clock64();
while (clock64() - start < ticks_per_ms) {
}
}
printf("first\n");
}
}
__global__ void second(uint32_t interval, const uint32_t ticks_per_ms) {
size_t globaltid = blockIdx.x * blockDim.x + threadIdx.x;
if (globaltid == 0) {
printf("\nsecond..");
while (interval--) {
uint64_t start = clock64();
while (clock64() - start < ticks_per_ms) {
}
}
printf("second\n");
}
}
__global__ void third(uint32_t interval, const uint32_t ticks_per_ms) {
size_t globaltid = blockIdx.x * blockDim.x + threadIdx.x;
if (globaltid == 0) {
printf("\nthird..");
while (interval--) {
uint64_t start = clock64();
while (clock64() - start < ticks_per_ms) {
}
}
printf("third\n");
}
}
void cudaGraphsManual() {
cudaStream_t streamForGraph;
cudaGraph_t graph;
cudaGraphNode_t kernelNode;
CUDA_CHECK(cudaStreamCreate(&streamForGraph));
cudaKernelNodeParams kernelNodeParams = {0};
CUDA_CHECK(cudaGraphCreate(&graph, 0));
int ticks_per_ms = 0;
CUDA_CHECK(cudaDeviceGetAttribute(&ticks_per_ms, cudaDevAttrClockRate, 0));
uint32_t interval = std::chrono::milliseconds(1000).count();
void *kernelArgs[2] = {&interval,
&ticks_per_ms};
kernelNodeParams.func = (void *)first;
kernelNodeParams.gridDim = dim3(1, 1, 1);
kernelNodeParams.blockDim = dim3(1, 1, 1);
kernelNodeParams.sharedMemBytes = 0;
kernelNodeParams.kernelParams = kernelArgs;
kernelNodeParams.extra = NULL;
CUDA_CHECK(cudaGraphAddKernelNode(&kernelNode, graph, NULL, 0, &kernelNodeParams));
union cudaKernelNodeAttrValue p1; p1.priority = 0;
CUDA_CHECK(cudaGraphKernelNodeSetAttribute(kernelNode, cudaLaunchAttributePriority, &p1));
union cudaKernelNodeAttrValue p4;
CUDA_CHECK(cudaGraphKernelNodeGetAttribute(kernelNode, cudaLaunchAttributePriority, &p4));
printf("\nPriority : %d\n", p4.priority);
kernelNodeParams.func = (void *)second;
CUDA_CHECK(cudaGraphAddKernelNode(&kernelNode, graph, NULL, 0, &kernelNodeParams));
union cudaKernelNodeAttrValue p2; p2.priority = 2;
CUDA_CHECK(cudaGraphKernelNodeSetAttribute(kernelNode, cudaLaunchAttributePriority, &p2));
CUDA_CHECK(cudaGraphKernelNodeGetAttribute(kernelNode, cudaLaunchAttributePriority, &p4));
printf("\nPriority : %d\n", p4.priority);
kernelNodeParams.func = (void *)third;
CUDA_CHECK(cudaGraphAddKernelNode(&kernelNode, graph, NULL, 0, &kernelNodeParams));
union cudaKernelNodeAttrValue p3; p3.priority = 1;
CUDA_CHECK(cudaGraphKernelNodeSetAttribute(kernelNode, cudaLaunchAttributePriority, &p3));
CUDA_CHECK(cudaGraphKernelNodeGetAttribute(kernelNode, cudaLaunchAttributePriority, &p4));
printf("\nPriority : %d\n", p4.priority);
cudaGraphExec_t graphExec;
CUDA_CHECK(cudaGraphInstantiateWithFlags(&graphExec, graph, cudaGraphInstantiateFlagUseNodePriority));
CUDA_CHECK(cudaGraphLaunch(graphExec, streamForGraph));
CUDA_CHECK(cudaStreamSynchronize(streamForGraph));
CUDA_CHECK(cudaGraphExecDestroy(graphExec));
CUDA_CHECK(cudaGraphDestroy(grap
> Blockquote
h));
CUDA_CHECK(cudaStreamDestroy(streamForGraph));
}
int main(int argc, char **argv) {
cudaGraphsManual();
return EXIT_SUCCESS;
}
when posting code here, please format your code properly. a simple method is to edit your post by clicking the pencil icon underneath it. Then select the code in your post. Then click the </>
at the top of the edit pane. Then save your changes.
You seem to have ignored the advice given to you by striker159 here. When I change one of your requested priorities to -1, I get that reflected in the output.
Regarding the rest, I’m not really sure what your expectations are. Your kernels are clearly capable of running concurrently. You haven’t established any dependencies, so my expectation is that those kernels would run concurrently. In any event CUDA doesn’t guarantee any execution order of such kernels, with or without priority. They can all execute, so I’m not sure what you are expecting from the priority statements.
I haven’t studied the topic closely, but my expectation here would basically be an analog of CUDA stream priorities. If I launch 3 kernels with 1 threadblock each, stream priorities won’t prevent or order their execution in any way.
If your desired goal is that node priority causes the higher priority to node to execute to completion before the lower priority node begins, you are mistaken, that is not what node priority does, and your test case proves it. You may want to study CUDA stream priorities. There are numerous questions about it on these forums, as well as a section on it in the programming guide.
Thank you @Robert_Crovella . I have gone through some of articles on CUDA stream priority however there are 2 priorities in cudaGraph to my understanding.
1/ Stream priority
2/ Per node priority
And as per CUDA doc for cudaGraphInstantiateFlagUseNodePriority
Run the graph using the per-node priority attributes rather than the priority of the stream it is launched into.
Does it not meant to have priority for kernel graph node on top of stream priority on which graph is launched?
Is there CUDA sample/article on cudaGraphInstantiateFlagUseNodePriority to understand it more? I could not gather much on cudaGraphInstantiateFlagUseNodePriority.
Thank you!
rather than means “instead of” not “on top of” in my opinion.
node priority is analogous to stream priority. I didn’t say they are identical in every way. If you use node priority, you should expect that the node priority will apply, “rather than” stream priority.
Each graph is launched into a stream. But this by itself is not enough to fully inform us on graph behavior, because we know that due to the dependency mechanism in graphs, if we specify two graph kernel nodes that are both dependent on another node, but have no other dependencies, those two kernel nodes have the potential to execute concurrently. This would not be the case for two kernels launched into the same stream that the graph is launched into. Therefore CUDA graphs introduce some kind of mechanism (not unlike streams) to allow for this type of concurrency, as well as to support the dependency mechanisms of CUDA graphs. This type of concurrency mechanism corresponds to one of our two canonical CUDA stream semantics statements:
“items launched into separate created streams have no ordering imposed by CUDA”
whereas the dependency mechanism has an analog in the other canonical CUDA stream semantics statement:
“items launched into the same stream will execute in issue order”
So the node priority mechanism introduces something like (analogous to) CUDA stream priorities, into the midst of a CUDA graph, as if there is a new stream mechanism introduced into the midst of a CUDA graph. As mentioned already, there is something like a “new CUDA stream mechanism” introduced into the midst of a graph, to support concurrency of nodes (and also dependency), even though at a global level, like the rest of the graph, they are launched into the same stream.
No, I don’t have articles or samples to suggest specifically for graph node priority.