CUDA with OpenCV Aruco Module

I am working on a project intended to eventually run on a Jetson TX1 that uses CUDA to accelerate the marker detection and pose estimation features of the OpenCV Aruco module (which deals with fiducial markers). I have some questions on how to approach parallelizing code with CUDA, especially considering that most of the codebase is currently written in Python (implying heavy use of PyCUDA, perhaps).

I noticed that when building OpenCV after having installed the CUDA toolkit, it recognizes that I have CUDA 8.0 installed and builds a bunch of CUDA-related files. I am under the assumption that CUDA optimizations are available, or are already being used under the hood with the OpenCV features. My first question is, if CUDA features or optimizations are already built into my OpenCV build, how do I make use of them?

Furthermore, if I end up having to write a non-trivial amount of CUBIN to parallelize certain parts of the Aruco algorithms, such as thresholding and filtering, would the best way to go about it indeed be writing functions in CUBIN files to handle those algorithms, and using PyCUDA to call them from the body of the algorithm that doesn’t require as much computation?

I also noticed VisionWorks has features that could be relevant to my project. Should I instead focus my sights on that library then?

Thanks!