It appears the Basic Linear Algebra Subroutine library implements CUDA (parallel) techniques buried under a layer of abstraction.
Can one gain access to the optimized subroutines without the layer of abstraction in order to call from a CUDA or OpenCL kernel?

How is CUBLAS expected to operate in an OpenCL program?

Please refer to simpleCUBLAS.c in the CUDA SDK

Thank you.

All CUBLAS library calls are callable from CPU (host) code, not code executing on GPU. So, you can’t call a CUBLAS function from OpenCL kernel, just like you can’t call it from a CUDA kernel.