It appears the Basic Linear Algebra Subroutine library implements CUDA (parallel) techniques buried under a layer of abstraction.
Can one gain access to the optimized subroutines without the layer of abstraction in order to call from a CUDA or OpenCL kernel?
How is CUBLAS expected to operate in an OpenCL program?
Please refer to simpleCUBLAS.c in the CUDA SDK
Thank you.