[VisionWorks] Parallel execution of nodes (or graphs)

StefanoAldegheri · February 2, 2018, 3:06pm

Hello everybody,
I am trying to get VisionWorks to execute at the same time multiple nodes. I am developing on Jetson TX2 board updated with the last Jetpack, 3.2, so I am using VisionWorks 1.6

From the VisionWorks reference specifications, under the “OpenVX Design Overview”, I read the sequente statement:
“The vxMagnitudeNode and vxPhaseNode are independently computed, in that each does not depend on the output of the other. OpenVX does not mandate that they are run simultaneously or in parallel, but it could be implemented this way by the OpenVX vendor.”

I’m trying to utilize both CPU and GPU simultaneously in my vision application to maximize the performance. In my tests, I found that node in OpenVX graph are always executed in serial, even if the nodes could be executed in parallel.

To test this behaviour, I utilize this simple graph (no real application, just want to test the node indipendence execution). It assumes that a “big_image.jpg” is present, otherwise you can launch it with -s parameter to get your image.
At first, it launch a single graph with the possible CPU/GPU combination between Gaussian and Median filter. After that, a copy of the first graph is made and the two graph are executed in an asynchronous way. However, both the approaches didn’t utilize both CPU and GPU at the same time.

I attach the code

#include <NVX/Application.hpp>
#include <NVX/Utility.hpp>
#include <OVX/UtilityOVX.hpp>
#include <OVX/FrameSourceOVX.hpp>
#include <OVX/RenderOVX.hpp>

int main(int argc, char *argv[])
{
	vx_context context = vxCreateContext();
	nvxio::Application &app = nvxio::Application::get();
	
	std::string sourceUri = "big_image.jpg";
	app.setDescription("Example application node indipendence");
	app.addOption('s', "source", "Source URI", nvxio::OptionHandler::string(&sourceUri));
	app.init(argc, argv);
	
	vx_image source = ovxio::loadImageFromFile(context, sourceUri, VX_DF_IMAGE_RGBX);
	
	vx_node gauss_node, median_node;
	vx_graph graph1 = vxCreateGraph(context);
	
	vx_image tmp_image1   = vxCreateVirtualImage(graph, 0, 0, VX_DF_IMAGE_U8);
	vx_image out_gauss1   = vxCreateVirtualImage(graph, 0, 0, VX_DF_IMAGE_VIRT);
	vx_image out_filter1  = vxCreateVirtualImage(graph, 0, 0, VX_DF_IMAGE_VIRT);
	
	vx_node extract_node1 = vxColorConvertNode(graph1, source, tmp_image1);
	vx_node gauss_node1   = vxGaussian3x3Node(graph1, tmp_image1, out_gauss1);
	vx_node median_node1  = vxMedian3x3Node(graph1, tmp_image1, out_filter1);
	
	for(int i = 0; i < 4; i++)
	{
		// all combination CPU-GPU
		switch(i)
		{
			case 0: case 1: vxSetNodeTarget(gauss_node1, NVX_TARGET_CPU, NULL); break;
			case 2: case 3: vxSetNodeTarget(gauss_node1, NVX_TARGET_GPU, NULL); break;
		}
		switch(i)
		{
			case 0: case 2: vxSetNodeTarget(median_node1, NVX_TARGET_CPU, NULL); break;
			case 1: case 3: vxSetNodeTarget(median_node1, NVX_TARGET_GPU, NULL); break;
		}
		vxSetNodeTarget(gauss_node, NVX_TARGET_CPU, NULL);
	
		if (vxVerifyGraph(graph1) != VX_SUCCESS)
		{
			printf("Graph verification failed, see [NVX LOG] for details\n");
			fflush(stdout);
			exit(1);
		}
	
		vxProcessGraph(graph1);
	}
	
	// create the SAME graph but with different references!
	vx_graph graph2 = vxCreateGraph(context);
	
	vx_image tmp_image2   = vxCreateVirtualImage(graph2, 0, 0, VX_DF_IMAGE_U8);
	vx_image out_gauss2   = vxCreateVirtualImage(graph2, 0, 0, VX_DF_IMAGE_VIRT);
	vx_image out_filter2  = vxCreateVirtualImage(graph2, 0, 0, VX_DF_IMAGE_VIRT);
	
	vx_node extract_node2 = vxColorConvertNode(graph2, source, tmp_image2);
	vx_node gauss_node2   = vxGaussian3x3Node(graph2, tmp_image2, out_gauss2);
	vx_node median_node2  = vxMedian3x3Node(graph2, tmp_image2, out_filter2);
	
	vxSetNodeTarget(gauss_node1,  NVX_TARGET_GPU, NULL);
	vxSetNodeTarget(median_node1, NVX_TARGET_CPU, NULL);
	vxSetNodeTarget(gauss_node2,  NVX_TARGET_CPU, NULL);
	vxSetNodeTarget(median_node2, NVX_TARGET_GPU, NULL);
	
	if (vxVerifyGraph(graph1) != VX_SUCCESS)
	{
		printf("Graph verification failed, see [NVX LOG] for details\n");
		fflush(stdout);
		exit(1);
	}
	if (vxVerifyGraph(graph2) != VX_SUCCESS)
	{
		printf("Graph verification failed, see [NVX LOG] for details\n");
		fflush(stdout);
		exit(1);
	}
	
	vxScheduleGraph(graph1);
	vxScheduleGraph(graph2);
	vxWaitGraph(graph1);
	vxWaitGraph(graph2);

	vxReleaseContext(&context);
}

that, in a standard Jetson TX2 configuration, will compile by using the following command:

nvcc -std=c++11 test_node.cpp -I /usr/share/visionworks/sources/nvxio/include/ -L/usr/share/visionworks/sources/libs/aarch64/linux/release/ -L /usr/local/cuda-8.0/targets/aarch64-linux/lib/ -lcudart -lnvx -lovx -lvisionworks `pkg-config --libs gstreamer-base-1.0 gstreamer-pbutils-1.0 gstreamer-app-1.0 glfw3` -o test_node

The graph looks like this

So, it is possible to get “Gaussian Filter” and “Median Filter” to get executed in parallel because they are indipendent to each other. To get the file to be imported in the NVIDIA Visual Profiler, I run

export NVX_PROF=nvtx
nvprof --out-profile profile.log ./test_node

And there results confirm that node are executed serially in the graph

and different graphs are executed serially w.r.t. each other

So here is the question: is it possible to achieve a CPU and GPU combined processing horsepower, for example running in the same time the “Median node” on CPU while “Gaussian node” is processed in the GPU (or vice-versa)?

Thank you in advance for the answers!

asgikling · July 12, 2018, 4:54pm

Stephano, Nvidia,

Bump! Did you ever figure this out?

I have a similar question I can’t seem to find the answer to on this forum or in the docs (or maybe it’s really obvious so nobody has the issue, or I haven’t looked hard enough).

Say I have to analyze images every 10ms off of a 100Hz framerate camera. Say my single graph of machine vision I’m running typically takes 25ms from my static image testing. Obviously we are going to fall behind but I’m not using very many of the CUDA cores available. So I should be able to parallel process this and potentially meet the overall framerate needed. Right?

This might be the answer to your problem but I would like Nvidia to chime in here. I don’t necessarily need a single graph to run two nodes in parallel - yes this is an optimization we would like (per the above stated question), but there might be a different way to construct the problem to achieve better parllelization?

What’s the correct way to run many graphs at once to approach 100% GPU utilization? Do I make multiple contexts and call all the functions for each context in different pthreads? Or do I make one context and many graphs? Then load each image into the graph and run it from different pthreads? This question is assuming each image coming from the camera can be analyzed independently.

How do I know pragmatically (without using the profiler) how many CUDA cores each graph is using? It would be nice to spin off a bunch of graphs and then when the GPU compute load approaches 100% simply slow down the frame rate of my camera with a PID loop… thoughts?

Why doesn’t the Visionworks examples have a “correct parallelization example?” Am I missing something?

StefanoAldegheri · July 30, 2018, 4:43pm

Hello,
the closest approach I found to get parallel execution is using multiple contexts. Just to recall:
[ulist]

Multiple nodes in single graph -> each node executed in serial wrt to each other

Multiple graphs in single context -> each graph executed in serial wrt to each other

One graphs in multiple context (1 graph per context) -> parallel execution

Multiple graphs in multiple context (multiple graph per context) -> parallel execution of context, but 1 graph per context

[/ulist]

So it seems that each context is able to run 1 graph per time. Using multiple contexts solve this issue, but in this way you introduce unnecessary copy operations to move data from your local memory to VisionWorks-managed memory location.

For your second question, I don’t think you can get those numbers. Even if you can get that, how can you bind how many CUDA cores are used per graph, if each node could potentially get different utilization?

I’d like some official response to these question, since this behaviour has a lot of influence in my research.

asgikling · July 31, 2018, 6:49pm

Stefano,

Thank you very much for the follow up. This is still confusing though. I’ll have to test your statements above I think and make a post of my findings (if there’s time). I’m not sure your 4 bullets are accurate though based on some more reading I’ve done.

Here’s a few items to consider. All of these need some better explanation by Nvidia - I feel like we’re missing a manual or something…

According to the documentation, using -O3 for NVX_GRAPH_VERIFY_OPTIONS will “activate the asynchronous CUDA execution beyond nodes boundaries.” Have you tried this? This is not the default option.
The Nvidia moderator that was in this very similar discussion: [url]https://devtalk.nvidia.com/default/topic/1006304/?comment=5154197#[/url] seems very sure that calling vxScheduleGraph multiple times on different graphs (from the same or different CPU pthread, when done in the same context) will cause the graphs to run in parallel on the GPU. We need to test this one out. That forum thread ended without a clear answer…
I read the “Programming Model” section of the VisionWorks 1.6 documentation closely. There is a section in the called “Valid Concurrent Calls” and an example called “Shared Graph Input” which spells out the rules surrounding concurrent graphs/contexts. It seems a lot of thought has been put into this by Nvidia. Therefor, I don’t understand why the Programming Model section doesn’t have more detail on the “correct” way to parallel GPU compute graphs/contexts when using one or more CPU threads.

Maybe we can get someone in here that can point to evidence or documentation that puts this to rest… for now I’ll try your suggestions and post my results if I can.

Thanks,

~Andy

asgikling · August 6, 2018, 4:01pm

Dear Nvidia Dev Team,

Ok so I’ve tested this more and it makes no sense. Someone explain these results!

With the following code, I can demonstrate with the profiler that we are not getting parallel execution of graphs. I can’t figure it out…

//Taken from Nvidia forum:
//https://devtalk.nvidia.com/default/topic/1029527/-visionworks-parallel-execution-of-nodes-or-graphs
//and further modified

#include <iostream>
#include <string>
#include <sstream>
#include <time.h>
#include <cmath>
#include <sstream>
#include <iomanip>
#include <memory>
#include <unistd.h>
#include <vector>
#include <algorithm>
#include <numeric>
#include <future>

#include <NVX/Application.hpp>
#include <NVX/Utility.hpp>
#include <OVX/UtilityOVX.hpp>
#include <OVX/FrameSourceOVX.hpp>
#include <OVX/RenderOVX.hpp>

void run(vx_graph g)
{
    vxScheduleGraph(g);
}
 
int main(int argc, char *argv[])
{
    vx_context context1 = vxCreateContext();
    vx_context context2 = vxCreateContext();
    nvxio::Application &app = nvxio::Application::get();
 
    vxRegisterLogCallback(context1, &ovxio::stdoutLogCallback, vx_false_e);
    vxRegisterLogCallback(context2, &ovxio::stdoutLogCallback, vx_false_e);
 
    std::string sourceUri1 = "big_image1.png";
    std::string sourceUri2 = "big_image2.png";
    app.setDescription("Example application node indipendence");
    app.addOption('s', "source", "Source URI", nvxio::OptionHandler::string(&sourceUri1));
    app.addOption('s2', "source", "Source URI", nvxio::OptionHandler::string(&sourceUri2));
    app.init(argc, argv);
               
    vx_image source1 = ovxio::loadImageFromFile(context1, sourceUri1, VX_DF_IMAGE_RGB);
    vx_image source2 = ovxio::loadImageFromFile(context2, sourceUri2, VX_DF_IMAGE_RGB);
               
    //-------------------------------------------------------------------------
    //Make graph1
    vx_graph graph1 = vxCreateGraph(context1);
               
    vx_image tmp_image1   = vxCreateVirtualImage(graph1, 0, 0, VX_DF_IMAGE_U8);
    vx_image out_gauss1   = vxCreateVirtualImage(graph1, 0, 0, VX_DF_IMAGE_VIRT);
    vx_image out_filter1  = vxCreateVirtualImage(graph1, 0, 0, VX_DF_IMAGE_VIRT);
               
    vx_node extract_node1 = vxColorConvertNode(graph1, source1, tmp_image1);
    vx_node gauss_node1   = vxGaussian3x3Node(graph1, tmp_image1, out_gauss1);
    vx_node median_node1  = vxMedian3x3Node(graph1, tmp_image1, out_filter1);
 
    //-------------------------------------------------------------------------
    //Make graph2
    //create a similar graph but with different references
    vx_graph graph2 = vxCreateGraph(context2);
               
    vx_image tmp_image2   = vxCreateVirtualImage(graph2, 0, 0, VX_DF_IMAGE_U8);
    vx_image out_gauss2   = vxCreateVirtualImage(graph2, 0, 0, VX_DF_IMAGE_VIRT);
    vx_image out_filter2  = vxCreateVirtualImage(graph2, 0, 0, VX_DF_IMAGE_VIRT);
 
    vx_node extract_node2 = vxColorConvertNode(graph2, source2, tmp_image2);
    vx_node gauss_node2   = vxGaussian3x3Node(graph2, tmp_image2, out_gauss2);
    vx_node box_node2  = vxBox3x3Node(graph2, tmp_image2, out_filter2);
 
    //Make sure er target the gpu
    vxSetNodeTarget(gauss_node1,  NVX_TARGET_GPU, NULL);
    vxSetNodeTarget(median_node1, NVX_TARGET_GPU, NULL);
    vxSetNodeTarget(extract_node1, NVX_TARGET_GPU, NULL);
    vxSetNodeTarget(gauss_node2,  NVX_TARGET_GPU, NULL);
    vxSetNodeTarget(box_node2, NVX_TARGET_GPU, NULL);
    vxSetNodeTarget(extract_node2, NVX_TARGET_GPU, NULL);
               
    if (vxVerifyGraph(graph1) != VX_SUCCESS)
    {
        printf("Graph1 verification failed, see [NVX LOG] for details\n");
        fflush(stdout);
        exit(1);
    }
    if (vxVerifyGraph(graph2) != VX_SUCCESS)
    {
        printf("Graph2 verification failed, see [NVX LOG] for details\n");
        fflush(stdout);
        exit(1);
    }
 
    std::future<void> future1[4];
    std::future<void> future2[4];
 
    //Need to optimize - doesn't seem to help us achieve parallel graphs though
    const char* option = "-O3";
    NVXIO_SAFE_CALL( vxSetGraphAttribute(graph1, NVX_GRAPH_VERIFY_OPTIONS, option, strlen(option)) );
    NVXIO_SAFE_CALL( vxSetGraphAttribute(graph2, NVX_GRAPH_VERIFY_OPTIONS, option, strlen(option)) );
 
    //Do some work 4 times - does anything run in parallel?
    for(int i = 0; i < 4; i++)
    {
        //For test 0, change the code above to make graph1 and graph2 run in the same context
        //Then call vxScheduleGraph
        //vxScheduleGraph(graph1);
        //vxScheduleGraph(graph2);
 
        //For test 1, simply schedule the work using the async call and see what the profiler shows, again
        //with graph1 and graph2 running in the same context
        //future1[i] = std::async(std::launch::async, run, graph1);
        //future2[i] = std::async(std::launch::async, run, graph2);
 
        //For test 2, use the code above with two graphs (one per context).
        //vxWaitGraph(graph1);
        //vxWaitGraph(graph2);
 
        //For test 3, try launching the work from different threads with two graphs (one per context).
        future1[i] = std::async(std::launch::async, run, graph1);
        future2[i] = std::async(std::launch::async, run, graph2);
 
        //Optionally, test and see how wait changes what the profiler shows us
        //vxWaitGraph(graph1);
        //vxWaitGraph(graph2);
    }
 
    //Sleep a long time so it's clear where our logic ends when looking at the profiler output
    sleep(3);
 
    vxReleaseContext(&context1);
    vxReleaseContext(&context2);
 
    printf("DONE\n");
}

Tests:

Two graphs from one context with one launch thread → no parallel.
Two graphs from one context with two different launch threads → no parallel.
Two graphs from two different contexts (one graph per context) one thread launch → no parallel.
Two graphs from two different contexts (one graph per context) different launch threads → no parallel.

This is a screenshot of the output of test 3:
https://drive.google.com/open?id=1pHQ6QjYoaMK17EqAOov5YJyvEcpXnNR1

Why, for example, wouldn’t the first two calls to “RGB_to_Grey” be shown on the timeline as executing in parallel? Why does the scheduler wait until the first one is done to run the second one in two different kernels - seems like a lot of wasted CUDA cores lol?

The answer to this question has profound implications. I refuse to believe the TX2 in incapable of what I’m after here. Someone explain what’s missing please.

~Andy

Robert_Crovella · August 6, 2018, 9:41pm

I’ve looked at the screenshot of the output of test 3, and at the moment it’s not obvious to me what the concern is.

I’ve not been through the code in great detail, but the compute timeline of that profiler output is solidly packed - it is wall-to-wall kernels. There are no gaps that I noticed.

I’m not sure if you were expecting the RGB-To-Grey kernels to overlap, for example, but it’s not obvious to me why they should, or why you think that would help.

If a CUDA kernel fully occupies the GPU (which would be the norm for any efficient kernel design with a problem of reasonable size, for example an image of 100x100 or larger), there is no reason to expect kernel overlap, and furthermore, overlap of kernels if it were possible, would not necessarily make things run quicker.

The TX2 GPU is not particularly large, so the number of threads/blocks to saturate the GPU would not be particularly large, meaning that it might be relatively easy to saturate the GPU.

Stated another way, if the 2 calls to RGB_to_Grey overlapped (lets say fully) then my expectation is that each would execute half as fast and therefore take twice as long. There would be no improvement in overall performance.

If you believe the opposite should be true, you’d need to justify why you think so (e.g. these are tiny kernel calls). The opposite claim does not stand by itself without support. If the opposite were simply true, then we could repeat this process ad infinitum - run 4 kernels in parallel, or 8, or 16, or 32. But the reality is that the machine has capacity limits. If a kernel hits a machine capacity limit, you will not typically witness kernel overlap, and even if you did, sharing a limited capacity among two kernels would only slow each one down.

The Jetson TX2 has 2 SM’s, which means it has a max theoretical instantaneous capacity of 4096 threads. Any kernel launch of 4096 total threads or larger should be able to saturate that GPU, preventing any other blocks or kernels from executing, until threads start to retire.

If the kernels depicted in your timeline are each of 4096 threads or larger, then there would be absolutely no reason to expect much overlap, and there would be little expectation of any substantial performance gain, even if overlap were witnessed.

The comment “seems like a lot of wasted CUDA cores lol?” probably indicates a lack of understanding of the CUDA execution model. Or else you should state which cores you think are wasted, and why.

Here’s a mental model that I sometimes use when I am teaching CUDA:

Suppose you have a wood chipper. I suspect you might know what a wood chipper is - if you for example hypothetically speaking lived in a place like Minnesota, or North Dakota, or Western New York.

I have a pile of brush from an oak tree, and a pile of brush from a maple tree. I am feeding bundles of wood/brush into the wood chipper, with an aim towards keeping the hopper constantly full.

Suppose I feed all the maple brush first, in bundles of the maximum size, then I feed all the oak brush, in bundles of maximum size.

Alternatively, suppose I create bundles of half maple and half oak. Do I get through the overall work any quicker? To a first order approximation, I do not. The wood chipper has a limited instantaneous capacity. As long as I keep the hopper full, it does not really matter how I feed the work.

The GPU, if it is given the oak brush first, to work on, will generally feed oak bundles into the wood chipper until the oak bundles are gone, then it will start picking up maple bundles. But even if it created mixed maple/oak bundles, it would not be any quicker.

And there is no reason to think that simply because I have maple bundles and also oak bundles that I can somehow feed them both into the limited capacity wood chipper, and get the overall job done in the same time that it takes me to do just the maple bundles. That’s not sensible; the machine doesn’t have infinite capacity.

asgikling · August 7, 2018, 3:47pm

Txbob,

I (we all) really appreciate you following up on this!

Your answer makes total sense. Also, you’re an astute individual - yes, I was/am completely unaware of the underlying CUDA execution model - my apologies. Coming from an industrial machine vision background (think Cognex cameras), I assumed that I was going to see each of my graphs showing up in the profiler, running in parallel, “one per available GPU core,” or something like that.

Furthermore, I do live in Minnesota lol! The wood chipper is a good analogy.

What’s unfortunate is there’s some misinformation from another forum thread surrounding this topic that I had read before posting here. The confusion of this other thread, coupled with a lack of information in the VisionWorks 1.6 documentation has been a real headache for someone trying to come up to speed.

The folks over in this other forum discussion are wondering about the same exact problem, but the moderator and all the other participants appear to be lost on this topic as well. Nobody has suggested the fact that the GPU might be already maxed out while working on the single graph it was presented and that’s why you don’t ever see two graphs running at the same time in the profiler. See here:

[url]https://devtalk.nvidia.com/default/topic/1006304/jetson-tx1/visionworks-how-can-i-execute-parallel-node-process-in-graph-/post/5139436/#5139436[/url]

You might want to set the record straight and post a link your response above in that misguided discussion. Or I will.

Ok so, I wouldn’t be doing my job if I didn’t follow up with a few new questions. Any more of your sage advice would be much appreciated:

When I make a graph node like “RGB_to_Gray” in the above example, is there any way to programmatically read “how many cores are running that node.” When I run that node, what is my actual GPU utilization at that moment?
Is there any documentation that better explains how CUDA cores will be utilized (ie, how the work of a node is broken up for many CUDA cores and many CUDA threads) for each of the various node types of the OpenVX standard? Or is this just a black box we don’t get to peer into? I’m really surprised that the “Programming Model” section of the VisionWorks 1.6 documentation makes no mention of the concepts in this forum discussion.
Can you suggest a C++ “optimized brush feeding strategy” to maximize GPU utilization? For example, I still don’t know if I should make 1 pthread on my CPU that continuously calls vxScheduleGraph() or should I actually use many pthreads? Similarly, as described above, how many graph, contexts and cpu pthreads should my C++ application maintain to ensure the wood chipper is always working at full capacity? A few words on how to avoid memory copies would be nice as well.
Once I have proven my wood chipper is at capacity through testing, how can I programatically know this is the case at runtime so I can tell my forestry team to slow down their truckloads of brush of arriving at my wood chipper? I suppose I can use some of the virtual file system calls similar to “$ sudo ~/tegrastats” to see GPU utilization, however, it seems like there would be a part of the CUDA api for this. This is a typical “related rates” calculus problem that everyone trying to build a semi-realtime system with a GPU will face.

In my machine vision application we have way more data (images) to process then the TX2 can keep up with so I need to be sure I’m maxing the GPU out to find what my max sustained image throughput is.

Thank you again for your time,

~Andy

Robert_Crovella · August 8, 2018, 3:48am

That’s not my read of it. Right off the bat, we can observe that in the profiler timeline posted here:

https://devtalk.nvidia.com/default/topic/1006304/jetson-tx1/visionworks-how-can-i-execute-parallel-node-process-in-graph-/post/5143397/#5143397

there are gaps where no CUDA kernels are executing. I don’t see that in your profiler timeline from here:

https://drive.google.com/open?id=1pHQ6QjYoaMK17EqAOov5YJyvEcpXnNR1

The processing graph in the other thread looks quite different from yours. I’m not sure why you would equate the two. I don’t think AaastaLLL is lost on this topic. I think (s)he is addressing a different observation (processing gaps in the timeline) even though the OP in that case was asking exactly about kernel overlap. Unfortunately AaastaLLL did not post a profiler timeline.

Anyway, I don’t wish to argue this point. It’s OK if we disagree.

Also, to be clear, I haven’t studied your code, nor have I run any of this. In order to declare “that kernel is saturating the GPU” I would either want to:

study the source code
or
use the profiler to hover my mouse over the kernel launches, look at their resource utilization in the lower right corner, and make a judgement call based on that.

I haven’t done either of those things. But if I witnessed a kernel that was launching 4096 or more threads, I would say “that kernel has a good chance of saturating a TX2 (at least for a certain duration) and during that saturation duration, I would not expect kernel overlap with any other kernel”. That’s a CUDA principle, and not specific to VXWorks. (Although the 4096 number is specific to the TX2. Different GPUs will have different instantaneous thread capacities.)

In the CUDA ecosystem, a kernel launch with 4096 or fewer threads is a really small kernel launch. But again, I haven’t checked this case.

Yes. It would certainly help if you were knowledgeable about CUDA programming, the CUDA execution model, and also basic profiling and optimization. First, most CUDA kernels (the well-written ones, anyway) expand to fill the machine. That is the basic CUDA execution model. There is no assignment of threads to cores, and a CUDA “core” is really quite different than a CPU “core”, anyway. Beginners should essentially ignore the concept of CUDA cores. The machine consists of SMs. Your objective as a programmer is to write kernels that are big enough to saturate the SMs on the GPU. To a first order approximation, 2048*#ofSMs is a good starting point, and more is quite OK. Such kernels are desirable, optimal, and will saturate the machine, largely preventing kernel overlap (ignoring the tail effect), or that aspect of concurrency. Second, the profiler can give you various hints when this is happening. You can hover over a kernel launch to get launch characteristics. The profiler can give you “occupancy” measurements. And there is other data. There is documentation for the profiler and various good presentations available on using the profiler as well as analysis-driven optimization. But, in case it’s lost in this wall of text, the GPU block scheduler seeks to fully utilize the entire GPU, with just a single kernel launch, if possible.

No, and this is the wrong way to think about things. CUDA cores are not what you think they are, and work scheduling on a GPU is quite different than work scheduling on a CPU. I hope some of my previous comments have pointed this out. You have some learning to do.

This question is probably mostly beyond me. I’m not enough of an expert in VxWorks to sort that out, and I won’t become one in the next day or so. I’m mostly addressing this topic from the standpoint of what I know about how a CUDA GPU behaves, not what I know about VxWorks (which isn’t much). However, to beat a dead horse, the profiler trace you showed already shows back-to-back kernel launches in the compute timeline (the compute timeline is solid blue) with essentially no gaps. I would say you are already where you want to be. Whatever you are doing is working, from what I can see. Remember the wood chipper? You’re keeping the hopper “full”. Good job.

My first question would be “why do you want to do that?”

If you were a forestry person, clearing out a section, do you ever really want to tell your team to “slow down” ? I wouldn’t.

Let’s extend the wood chipper model. The wood chipper has a hopper (FIFO queue) which has a certain width and height, defined by the dimensions of your ~~GPU~~ chipper blade, but it has (to a first order approximation) an infinite depth. Good programs queue up as much GPU work as possible, as early as possible and for as long as possible. Throw as much as possible in the FIFO queue, as quickly as possible. That is how you get stuff through the wood chipper as quickly as possible.

If you really want to “monitor” the queue processing rate, you could easily insert events into the queue (or even CUDA callbacks), and observe when those events trigger, to get a rate monitor. If you want to get fancier, yes, you can embed profiling directly into your runtime (although I don’t happen to know whether this is supported on Jetson processors or not):

https://docs.nvidia.com/cuda/cupti/index.html

I wouldn’t recommend CUPTI for beginners, though, and I really doubt what you are asking for needs that level of precision. It’s not really clear that what you are asking for is needed at all, but not going to argue that.

There’s quite a bit of CUDA documentation:

https://docs.nvidia.com/cuda/index.html

and having this under your belt will inform you a lot when it comes to using a system like VxWorks that is built on top of CUDA. Furthermore, to learn about CUDA profiling and optimization (not Jetson or VxWorks specific) I would recommend presentations like these:

http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf
http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf

But you’ll need basic CUDA concepts under your belt first:

http://www.nvidia.com/content/GTC-2010/pdfs/2131_GTC2010.pdf

I’m not going to be able to do much with VxWorks questions. I think you may get better results asking those questions over on the other forum you already pointed to:

https://devtalk.nvidia.com/default/board/139/jetson-embedded-systems/

asgikling · August 13, 2018, 6:12pm

Txbob,

Excellent - all this information is enormously helpful. I very much appreciate your time and effort in guiding me (and others) on this topic.

I guess in short, if you begin to really care about how well your VisionWorks code is performing, you better know the underlying CUDA compute model. I’ve begun reading about this now and everything you’ve said here makes way more sense.

The multi-camera vision system connected to my TX2 via USB3 is more than capable of overwhelming the GPU with even the simplest machine vision algorithms running on each frame. So, I’d like to automatically control the camera’s frame rate based on overall system utilization - that’s why my forestry team will be told to slow down. I love the wood chipper analogy though!

Thanks,

~Andy