Visionworks : how can I execute parallel node process in graph?

SanghoYeo · April 28, 2017, 7:34am

I tested visionworks using TX1

in the visionworks reference
there are some statement about parallel node process like this
“OpenVX does not mandate that they are run simultaneously or in parallel, but it could be implemented this way by the OpenVX vendor.”

so, I assumed that the visoinworks(implemented OpenVX by vendor), some parallel node execution may be
implemented.

but my test was different from what I think. when I tested the graph that is consist of three nodes that have no data dependency each other, there is no parallel excution in NVIDIA visual profiler

in my project, the problem above is very serious.

so here is my question.

visionworks don’t have better performance that OpenCV GPU module, when I tested single function(CUDA kerenl)

Why visionworks performance is same to OpenCV GPU Moduel function?

In the visionworks, how can I execute node concurrently?
if the visionworks doesn’t support concurrent node execution, How can I implement this?

when I tested the graph that is consist of three nodes that have no data dependency each other,
It execute serialized way, and there is no performance difference compare to OpenCV GPU module.

AastaLLL · May 2, 2017, 7:02am

Hi,

Sorry for the late reply.

1).
Which function do you use? This should be a function-dependent problem.

2).
Graph is executed sequentially since it is a pipeline-based architecture.
Two concurrent node should be implemented with two pipeline.(Surely, they can share the same input.)
For example, if you want to covert the same image into two different format:

#include <cmath>
#include <iostream>
#include <NVXIO/Render.hpp>

int main(int argc, char** argv)
{
    int width = 640;
    int height = 480;

    vx_context context = vxCreateContext();
    vx_image frame =  vxCreateImage(context, width, height, VX_DF_IMAGE_RGB);
    vx_image gray1 = vxCreateImage(context, width, height, VX_DF_IMAGE_U8);
    vx_image gray2 = vxCreateImage(context, width, height, VX_DF_IMAGE_U8);

    vx_graph graph1 = vxCreateGraph(context);
    vx_graph graph2 = vxCreateGraph(context);
    vx_node cvtNode1 = vxColorConvertNode(graph1, frame, gray1);
    vx_node cvtNode2 = vxColorConvertNode(graph2, frame, gray2);

    for(int i=0; i<1000; i++)
    {
        NVXIO_SAFE_CALL( vxProcessGraph(graph1) );
        NVXIO_SAFE_CALL( vxProcessGraph(graph2) );
    }

    vxReleaseNode(&cvtNode1);
    vxReleaseNode(&cvtNode2);
    vxReleaseImage(&frame);
    vxReleaseImage(&gray1);
    vxReleaseImage(&gray2);
    return nvxio::Application::APP_EXIT_CODE_SUCCESS;
}

Graph1 and graph2 will be executed concurrently.

SanghoYeo · May 2, 2017, 5:06pm

Thank for your kind responds,

but, when I test your code example, I found that code didn’t run by parallel.

so I was test many case,

like two heterogeneous device node(cpu, gpu) in one graph, using “nvxCreateStreamGraph” when create graph(I’m not sure this is related to parallel process)

and your suggestion (each one gpu node in two graph),

and I also test asynchronous excution graph model, but it also didn’t work!

I know that Visionworks is helpful for vision process optimization. but It didn’t give parallel process function to me!

How can I check two CUDA stream run concurrently in the visionworks?
(currently all cuda streams and cpu thread are run sequentially…)

below is my test code(modified to test asynchronous excution model, actually, that was my last test case)

int main(int argc, char** argv)
    {

	cv::Mat mat = cv::imread("512kb.jpg");
        int width = mat.cols;
        int height = mat.rows;
        vx_context context = vxCreateContext();
        vx_image frame =  nvx_cv::createVXImageFromCVMat(context, mat);
	int a = 0;

	//vx_image frame = vxCreateImage(context, width, height, VX_DF_IMAGE_RGBX);
        vx_image gray1 = vxCreateImage(context, width, height, VX_DF_IMAGE_U8);
        vx_image gray2 = vxCreateImage(context, width, height, VX_DF_IMAGE_U8);       


        vx_graph graph1 = vxCreateGraph(context);
        vx_graph graph2 = vxCreateGraph(context);
        vx_node cvtNode1 = vxColorConvertNode(graph1, frame, gray1);
        vx_node cvtNode2 = vxColorConvertNode(graph2, frame, gray2);
	vxSetNodeTarget(cvtNode1, NVX_TARGET_GPU, NULL);
	vxSetNodeTarget(cvtNode2, NVX_TARGET_GPU, NULL);

	double timeSec;	

        for(int i=0; i<1000; i++)
        {
	    //const  int64 startWhole = cv::getTickCount();
            //printf("%d", i);
	    //const  int64 startGraph1 = cv::getTickCount();
            NVXIO_SAFE_CALL( vxScheduleGraph(graph1) );
	    NVXIO_SAFE_CALL( vxScheduleGraph(graph2) );
	    if(i%50 == 0){
	      NVXIO_SAFE_CALL( vxWaitGraph(graph2) );
              NVXIO_SAFE_CALL( vxWaitGraph(graph1) );
	    }

	  
	    //timeSec = (cv::getTickCount() - startGraph1) / cv::getTickFrequency();
            //std::cout << " Graph1 Time : " << timeSec << " sec" << std::endl;
	    //const  int64 startGraph2 = cv::getTickCount();
            //NVXIO_SAFE_CALL( vxProcessGraph(graph2) );
	    //timeSec = (cv::getTickCount() - startGraph2) / cv::getTickFrequency();
            //std::cout << " Graph2 Time : " << timeSec << " sec" << std::endl;

	    //timeSec = (cv::getTickCount() - startWhole) / cv::getTickFrequency();
            //std::cout << " Whole Time : " << timeSec << " sec" << std::endl;
        }

        vxReleaseNode(&cvtNode1);
        vxReleaseNode(&cvtNode2);
        vxReleaseImage(&frame);
        vxReleaseImage(&gray1);
        vxReleaseImage(&gray2);


        return nvxio::Application::APP_EXIT_CODE_SUCCESS;
    }

AastaLLL · May 5, 2017, 2:25am

Hi,

How to you check concurrency?

Just confirm the sample posted in #2.
Two pipelines are launched into different cuda stream and is executed concurrently.

Could you share more information about your profiling step?

SanghoYeo · May 8, 2017, 9:06am

I have some screen shot about my test

when I test code with your suggestion, test result is like this.

if you see the bottom portion of the photo, you will see that two different stream doesn’t run concurrently.

My expectation is concurrent excution of stream, like below.

I already mentioned, my all tests doesn’t run concurrently(two graph with CPU, GPU node, one graph with CPU, GPU node
two graph with two GPU nodes, and even I run graph with vxScheduleGraph(which run graph by asynchornously))

In my thought, cudaStreamSynchornize always run if I excute graphs, so, If I can remove cudaStreamSynchornoize, It’ll be
run concurrently. if that is not possible, Visionworks may not give concurrent excution of node.

and now Me and My lab mates think we may write visionworks user kernel code with cuda to enable concurrent excution of node.

So, What do you think about above? Do we write user kernel code or is there other way to run node concurrently?

AastaLLL · May 10, 2017, 10:43am

Hi,

Thanks for your feedback.

But sorry that I can’t reproduce your situation.
Both vxProcessGraph and vxScheduleGraph run concurrently in my side, just like the photo you expected.

Could you try to enable clock first?

sudo ./jetson_clocks.sh

Looks like your application takes lots of time in CPU computing.
Could you attach your source code completely for us debugging?

By the way, did you call the synchronize function since I saw it on the profiling figure?

SanghoYeo · May 21, 2017, 4:22pm

Sorry for Late response and thank you for answer, AastaLLL
I didn’t know there is some script for cpu optimization,
But that wasn’t reason that computation takes lots of time in CPU computing.
actually, I found that visionworks didn’t use all cpu core, they use only one core,
so I think this is the main reason why cpu time is very long,

when I test my code, I chnage my code frequent for other case test, but basically, I wrote this code by your suggestion , and I didn’t call any synchronize function on the profiling figure, it was automically call by visionworks…

#include <cmath>
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <NVXIO/Render.hpp>
#include <NVX/nvx_opencv_interop.hpp>
#include <opencv2/opencv.hpp>
#include <NVX/nvx.h>
#include <NVX/nvx_timer.hpp>

#include <NVXIO/Application.hpp>
#include <NVXIO/ConfigParser.hpp>
#include <NVXIO/FrameSource.hpp>
#include <NVXIO/SyncTimer.hpp>
#include <NVXIO/Utility.hpp>
#include <opencv2/core/core.hpp>
#include <opencv2/core/mat.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
    int main(int argc, char** argv)
    {

	cv::Mat mat = cv::imread("512kb.jpg");
        int width = mat.cols;
        int height = mat.rows;
        vx_context context = vxCreateContext();
        vx_image frame1 =  nvx_cv::createVXImageFromCVMat(context, mat);
	//vx_image frame2 =  nvx_cv::createVXImageFromCVMat(context, mat);
	int a = 0;

	//vx_image frame = vxCreateImage(context, width, height, VX_DF_IMAGE_RGBX);
        vx_image gray1 = vxCreateImage(context, width, height, VX_DF_IMAGE_U8);
        vx_image gray2 = vxCreateImage(context, width, height, VX_DF_IMAGE_U8);       


        vx_graph graph1 = vxCreateGraph(context);
        vx_graph graph2 = vxCreateGraph(context);
	//make two nodes
        vx_node cvtNode1 = vxColorConvertNode(graph1, frame1, gray1);
        vx_node cvtNode2 = vxColorConvertNode(graph2, frame1, gray2);

	//when I want to test CPU version, I chaged NVX_TARGET_GPU
	vxSetNodeTarget(cvtNode1, NVX_TARGET_GPU, NULL);
	vxSetNodeTarget(cvtNode2, NVX_TARGET_GPU, NULL);

	double timeSec;	

        for(int i=0; i<100; i++)
        {
	    //const  int64 startWhole = cv::getTickCount();
            //printf("%d", i);
	    //const  int64 startGraph1 = cv::getTickCount();
            
            //if I want to change graph's sync mode to asynchornous, I change vxProcessGraph to vxScheduleGraph 
	    NVXIO_SAFE_CALL( vxProcessGraph(graph1) );
	    NVXIO_SAFE_CALL( vxProcessGraph(graph2) );

        }

        vxReleaseNode(&cvtNode1);
        vxReleaseNode(&cvtNode2);
        vxReleaseImage(&frame1);
        vxReleaseImage(&gray1);
        vxReleaseImage(&gray2);


        return nvxio::Application::APP_EXIT_CODE_SUCCESS;
    }

and this is one core running on the visionworks when I change node mode to CPU.

and another picture is the opencv4tegra’s cpu multicore usage…

AastaLLL · May 22, 2017, 6:27am

Hi,

Thanks for your feedback.
But there is something I want to confirm first:

All the concurrency I mentioned before is for GPU.
Do you take care about CPU parallelism?

Although we do optimize openvx with ARM architecture, we are caring more about GPU implementation.

SanghoYeo · May 22, 2017, 6:41am

Thank you for your cooperation again, AastaLLL.

yes, we also consider CPU optimization. our lab is distributed computing lab. so, we also concern about improved operation efficiency between heterogeneous devices.

and In the tx1 board, actually, GPU’s performance is worse than desktop’s one. so In my opinion, optimizing CPU is very important because, because the performance of the GPU is bad, the performance of the CPU is relatively good.

anyway, in the gpu stream concurrency issue, I found interesting article about that.

but I think in the visionworks and opencv it is impossible optimization. because I assume the thread in each CUDA stream
can automatically set by visionworks and opencv. is it right? or is it possible in visionworks and opencv also?

AastaLLL · May 23, 2017, 5:32am

Hi,

This a little confusing.

Surely, tx1 won’t achieve the computation power as desktop GPU.
But compared to A57, GPU should have better performance on image processing issue.

So, could you describe your observation more precisely.
What make you feel GPU performance is bad and CPU is relatively good?

SanghoYeo · May 23, 2017, 6:03am

My meaning is that the performance difference between cpu and gpu on the tx1 board is smaller than the performance difference between cpu and gpu on the desktop. and I also know that CPU and GPU can work in parallel,
so If CPU can work with GPU in tx1 board, It will be more helpful than desktop case.
but unfortunately, in my experiment, it is not possible and our tx1 board don’t run CUDA kernel concurrently.

AastaLLL · May 25, 2017, 6:59am

Hi,

We should figure out why we have different conclusion on TX1.

CUDA can run concurrently.
CPU and GPU also can run concurrently (via asynchronous launch).

Could you share the main purpose of your experiment?
Do you want to compare visionworks and opencv?

asgikling · August 13, 2018, 6:17pm

If you find yourself reading to right here and are still very confused (as I was), you should have a look at this thread that discusses many of the same ideas:

https://devtalk.nvidia.com/default/topic/1029527/-visionworks-parallel-execution-of-nodes-or-graphs-

Understanding when/if your scheduled work is saturating the GPU can explain why you are not seeing VisionWorks kernels run in parallel when using the System Profiler.

~Andy

Topic		Replies	Views
[VisionWorks] Parallel execution of nodes (or graphs) GPU-Accelerated Libraries	8	2533	August 13, 2018
VisionWorks OpenVX vs OpenCV Jetson TX2	6	3151	October 18, 2021
Overlapping CPU and GPU operations using streams. Total failure. Any help? CUDA Programming and Performance	6	5972	April 2, 2013
Cannot get any stream parallelism. CUDA Programming and Performance	13	1241	December 31, 2019
Getting Started with CUDA Graphs Technical Blog	11	1878	January 8, 2024
Visionworks : Test about energy efficiency and performance Jetson TX1	4	775	October 18, 2021
Unable to run concurrent opencv cuda functions through Streams CUDA Programming and Performance opencv , cuda	2	1371	May 28, 2021
CUDA/OpenCL runs multiple GPUs sequentially CUDA Programming and Performance	16	19306	November 26, 2015
User defined Custom CUDA node profiling using Nsight Jetson TX1	16	1902	October 18, 2021
VisionWorks + 2 CSI Cameras + White Balance Control Jetson TX2	12	1665	October 18, 2021

Visionworks : how can I execute parallel node process in graph?

Related topics