OpenVX-VisionWorks' nodes executed sequentially


I am developing a stereo vision application using OpenVX framework. It will be deployed on Jetson TX1 embedded platform, but currently I am testing it on a Ubuntu 16.04 machine with Quadro K420 GPU and some Intel i5 processor.

It is a graph based pipeline that contains some OpenVX nodes, such as vxRemapNode, vxMeanStdDevNode and a custom node that I have implemented using CUDA. It takes two images from the left and right cameras and produces a disparity. Both vxRemapNode and vxMeanStdDevNode are applied upon both images, therefore I expected here to see some concurrency. However both images are processed one after another. The next thing I don’t understand is why those two nodes are executed on the GPU(based on the profiling results)?
I also tried using the medianBlur node and it gave the same results in terms of execution.

In order to see what happens I tried using NVIDIA visual profiler. You can see the profiling results here: