Visionworks : how can I execute parallel node process in graph?

I tested visionworks using TX1

in the visionworks reference
there are some statement about parallel node process like this
“OpenVX does not mandate that they are run simultaneously or in parallel, but it could be implemented this way by the OpenVX vendor.”

so, I assumed that the visoinworks(implemented OpenVX by vendor), some parallel node execution may be
implemented.

but my test was different from what I think. when I tested the graph that is consist of three nodes that have no data dependency each other, there is no parallel excution in NVIDIA visual profiler

in my project, the problem above is very serious.

so here is my question.

  1. visionworks don’t have better performance that OpenCV GPU module, when I tested single function(CUDA kerenl)

Why visionworks performance is same to OpenCV GPU Moduel function?

  1. In the visionworks, how can I execute node concurrently?
    if the visionworks doesn’t support concurrent node execution, How can I implement this?

when I tested the graph that is consist of three nodes that have no data dependency each other,
It execute serialized way, and there is no performance difference compare to OpenCV GPU module.

1 Like

Hi,

Sorry for the late reply.

1).
Which function do you use? This should be a function-dependent problem.

2).
Graph is executed sequentially since it is a pipeline-based architecture.
Two concurrent node should be implemented with two pipeline.(Surely, they can share the same input.)
For example, if you want to covert the same image into two different format:

#include <cmath>
#include <iostream>
#include <NVXIO/Render.hpp>

int main(int argc, char** argv)
{
    int width = 640;
    int height = 480;

    vx_context context = vxCreateContext();
    vx_image frame =  vxCreateImage(context, width, height, VX_DF_IMAGE_RGB);
    vx_image gray1 = vxCreateImage(context, width, height, VX_DF_IMAGE_U8);
    vx_image gray2 = vxCreateImage(context, width, height, VX_DF_IMAGE_U8);

    vx_graph graph1 = vxCreateGraph(context);
    vx_graph graph2 = vxCreateGraph(context);
    vx_node cvtNode1 = vxColorConvertNode(graph1, frame, gray1);
    vx_node cvtNode2 = vxColorConvertNode(graph2, frame, gray2);

    for(int i=0; i<1000; i++)
    {
        NVXIO_SAFE_CALL( vxProcessGraph(graph1) );
        NVXIO_SAFE_CALL( vxProcessGraph(graph2) );
    }

    vxReleaseNode(&cvtNode1);
    vxReleaseNode(&cvtNode2);
    vxReleaseImage(&frame);
    vxReleaseImage(&gray1);
    vxReleaseImage(&gray2);
    return nvxio::Application::APP_EXIT_CODE_SUCCESS;
}

Graph1 and graph2 will be executed concurrently.

Thank for your kind responds,

but, when I test your code example, I found that code didn’t run by parallel.

so I was test many case,

like two heterogeneous device node(cpu, gpu) in one graph, using “nvxCreateStreamGraph” when create graph(I’m not sure this is related to parallel process)

and your suggestion (each one gpu node in two graph),

and I also test asynchronous excution graph model, but it also didn’t work!

I know that Visionworks is helpful for vision process optimization. but It didn’t give parallel process function to me!

How can I check two CUDA stream run concurrently in the visionworks?
(currently all cuda streams and cpu thread are run sequentially…)

below is my test code(modified to test asynchronous excution model, actually, that was my last test case)

int main(int argc, char** argv)
    {

	cv::Mat mat = cv::imread("512kb.jpg");
        int width = mat.cols;
        int height = mat.rows;
        vx_context context = vxCreateContext();
        vx_image frame =  nvx_cv::createVXImageFromCVMat(context, mat);
	int a = 0;

	//vx_image frame = vxCreateImage(context, width, height, VX_DF_IMAGE_RGBX);
        vx_image gray1 = vxCreateImage(context, width, height, VX_DF_IMAGE_U8);
        vx_image gray2 = vxCreateImage(context, width, height, VX_DF_IMAGE_U8);       


        vx_graph graph1 = vxCreateGraph(context);
        vx_graph graph2 = vxCreateGraph(context);
        vx_node cvtNode1 = vxColorConvertNode(graph1, frame, gray1);
        vx_node cvtNode2 = vxColorConvertNode(graph2, frame, gray2);
	vxSetNodeTarget(cvtNode1, NVX_TARGET_GPU, NULL);
	vxSetNodeTarget(cvtNode2, NVX_TARGET_GPU, NULL);

	double timeSec;	

        for(int i=0; i<1000; i++)
        {
	    //const  int64 startWhole = cv::getTickCount();
            //printf("%d", i);
	    //const  int64 startGraph1 = cv::getTickCount();
            NVXIO_SAFE_CALL( vxScheduleGraph(graph1) );
	    NVXIO_SAFE_CALL( vxScheduleGraph(graph2) );
	    if(i%50 == 0){
	      NVXIO_SAFE_CALL( vxWaitGraph(graph2) );
              NVXIO_SAFE_CALL( vxWaitGraph(graph1) );
	    }

	  
	    //timeSec = (cv::getTickCount() - startGraph1) / cv::getTickFrequency();
            //std::cout << " Graph1 Time : " << timeSec << " sec" << std::endl;
	    //const  int64 startGraph2 = cv::getTickCount();
            //NVXIO_SAFE_CALL( vxProcessGraph(graph2) );
	    //timeSec = (cv::getTickCount() - startGraph2) / cv::getTickFrequency();
            //std::cout << " Graph2 Time : " << timeSec << " sec" << std::endl;

	    //timeSec = (cv::getTickCount() - startWhole) / cv::getTickFrequency();
            //std::cout << " Whole Time : " << timeSec << " sec" << std::endl;
        }

        vxReleaseNode(&cvtNode1);
        vxReleaseNode(&cvtNode2);
        vxReleaseImage(&frame);
        vxReleaseImage(&gray1);
        vxReleaseImage(&gray2);


        return nvxio::Application::APP_EXIT_CODE_SUCCESS;
    }

Hi,

How to you check concurrency?

Just confirm the sample posted in #2.
Two pipelines are launched into different cuda stream and is executed concurrently.

Could you share more information about your profiling step?

I have some screen shot about my test

when I test code with your suggestion, test result is like this.

if you see the bottom portion of the photo, you will see that two different stream doesn’t run concurrently.

My expectation is concurrent excution of stream, like below.

I already mentioned, my all tests doesn’t run concurrently(two graph with CPU, GPU node, one graph with CPU, GPU node
two graph with two GPU nodes, and even I run graph with vxScheduleGraph(which run graph by asynchornously))

In my thought, cudaStreamSynchornize always run if I excute graphs, so, If I can remove cudaStreamSynchornoize, It’ll be
run concurrently. if that is not possible, Visionworks may not give concurrent excution of node.

and now Me and My lab mates think we may write visionworks user kernel code with cuda to enable concurrent excution of node.

So, What do you think about above? Do we write user kernel code or is there other way to run node concurrently?

Hi,

Thanks for your feedback.

But sorry that I can’t reproduce your situation.
Both vxProcessGraph and vxScheduleGraph run concurrently in my side, just like the photo you expected.

Could you try to enable clock first?

sudo ./jetson_clocks.sh

Looks like your application takes lots of time in CPU computing.
Could you attach your source code completely for us debugging?

By the way, did you call the synchronize function since I saw it on the profiling figure?

Sorry for Late response and thank you for answer, AastaLLL
I didn’t know there is some script for cpu optimization,
But that wasn’t reason that computation takes lots of time in CPU computing.
actually, I found that visionworks didn’t use all cpu core, they use only one core,
so I think this is the main reason why cpu time is very long,

when I test my code, I chnage my code frequent for other case test, but basically, I wrote this code by your suggestion , and I didn’t call any synchronize function on the profiling figure, it was automically call by visionworks…

#include <cmath>
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <NVXIO/Render.hpp>
#include <NVX/nvx_opencv_interop.hpp>
#include <opencv2/opencv.hpp>
#include <NVX/nvx.h>
#include <NVX/nvx_timer.hpp>

#include <NVXIO/Application.hpp>
#include <NVXIO/ConfigParser.hpp>
#include <NVXIO/FrameSource.hpp>
#include <NVXIO/SyncTimer.hpp>
#include <NVXIO/Utility.hpp>
#include <opencv2/core/core.hpp>
#include <opencv2/core/mat.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
    int main(int argc, char** argv)
    {

	cv::Mat mat = cv::imread("512kb.jpg");
        int width = mat.cols;
        int height = mat.rows;
        vx_context context = vxCreateContext();
        vx_image frame1 =  nvx_cv::createVXImageFromCVMat(context, mat);
	//vx_image frame2 =  nvx_cv::createVXImageFromCVMat(context, mat);
	int a = 0;

	//vx_image frame = vxCreateImage(context, width, height, VX_DF_IMAGE_RGBX);
        vx_image gray1 = vxCreateImage(context, width, height, VX_DF_IMAGE_U8);
        vx_image gray2 = vxCreateImage(context, width, height, VX_DF_IMAGE_U8);       


        vx_graph graph1 = vxCreateGraph(context);
        vx_graph graph2 = vxCreateGraph(context);
	//make two nodes
        vx_node cvtNode1 = vxColorConvertNode(graph1, frame1, gray1);
        vx_node cvtNode2 = vxColorConvertNode(graph2, frame1, gray2);

	//when I want to test CPU version, I chaged NVX_TARGET_GPU
	vxSetNodeTarget(cvtNode1, NVX_TARGET_GPU, NULL);
	vxSetNodeTarget(cvtNode2, NVX_TARGET_GPU, NULL);

	double timeSec;	

        for(int i=0; i<100; i++)
        {
	    //const  int64 startWhole = cv::getTickCount();
            //printf("%d", i);
	    //const  int64 startGraph1 = cv::getTickCount();
            
            //if I want to change graph's sync mode to asynchornous, I change vxProcessGraph to vxScheduleGraph 
	    NVXIO_SAFE_CALL( vxProcessGraph(graph1) );
	    NVXIO_SAFE_CALL( vxProcessGraph(graph2) );

        }

        vxReleaseNode(&cvtNode1);
        vxReleaseNode(&cvtNode2);
        vxReleaseImage(&frame1);
        vxReleaseImage(&gray1);
        vxReleaseImage(&gray2);


        return nvxio::Application::APP_EXIT_CODE_SUCCESS;
    }

and this is one core running on the visionworks when I change node mode to CPU.

and another picture is the opencv4tegra’s cpu multicore usage…

Hi,

Thanks for your feedback.
But there is something I want to confirm first:

All the concurrency I mentioned before is for GPU.
Do you take care about CPU parallelism?

Although we do optimize openvx with ARM architecture, we are caring more about GPU implementation.

Thank you for your cooperation again, AastaLLL.

yes, we also consider CPU optimization. our lab is distributed computing lab. so, we also concern about improved operation efficiency between heterogeneous devices.

and In the tx1 board, actually, GPU’s performance is worse than desktop’s one. so In my opinion, optimizing CPU is very important because, because the performance of the GPU is bad, the performance of the CPU is relatively good.

anyway, in the gpu stream concurrency issue, I found interesting article about that.
https://www.quora.com/Does-using-CUDA-Stream-API-improve-GPU-Occupancy

but I think in the visionworks and opencv it is impossible optimization. because I assume the thread in each CUDA stream
can automatically set by visionworks and opencv. is it right? or is it possible in visionworks and opencv also?

Hi,

This a little confusing.

Surely, tx1 won’t achieve the computation power as desktop GPU.
But compared to A57, GPU should have better performance on image processing issue.

So, could you describe your observation more precisely.
What make you feel GPU performance is bad and CPU is relatively good?

My meaning is that the performance difference between cpu and gpu on the tx1 board is smaller than the performance difference between cpu and gpu on the desktop. and I also know that CPU and GPU can work in parallel,
so If CPU can work with GPU in tx1 board, It will be more helpful than desktop case.
but unfortunately, in my experiment, it is not possible and our tx1 board don’t run CUDA kernel concurrently.

Hi,

We should figure out why we have different conclusion on TX1.

  1. CUDA can run concurrently.
  2. CPU and GPU also can run concurrently (via asynchronous launch).

Could you share the main purpose of your experiment?
Do you want to compare visionworks and opencv?

If you find yourself reading to right here and are still very confused (as I was), you should have a look at this thread that discusses many of the same ideas:

https://devtalk.nvidia.com/default/topic/1029527/-visionworks-parallel-execution-of-nodes-or-graphs-

Understanding when/if your scheduled work is saturating the GPU can explain why you are not seeing VisionWorks kernels run in parallel when using the System Profiler.

~Andy