Is there any device function for matrix operation?

I have an image, and I need to process each pixel to doing some transform or even pseudo inverse.
I think these operation is just like the graphics card doing, but it seems like there are no build in function to deal with matrix.
Is there any library provide the structure and operation to do this?
Library I found like cuBlas is doing large matrix operation using GPU on host but not small matrix operation on device.

ArrayFire provides a big set of operations for matrices. Enjoy!

CUBLAS can operate on matrices of any size, from very small to as large as will fit on a given GPU. Efficiency can suffer for small matrices due to lack of parallelism, but it may still be advantageous to keep data resident on the GPU and do all the processing there, avoiding copies between host and device. CUBLAS also has limited support for batched operations (such as batched GEMM) to better support operations on small matrices.

On our registered developer website, we have posted code for batched matrix inverses and solvers (with partial pivoting) on small matrices that can fit into shared memory, which you could use as a starting point for your own batched operations. The code is made available under BSD license.