Element-wise Operations: Looking for Optimized kernel i.e, source for cublasSaxpy

Is there any optimized code available that performs element-wise operations on vectors/matrices?
I’m interested in putting together such code for multiple C operators (such as +, -, *, /)

I know it’s one of the simplest things you can perform on a GPU, but I don’t want to reinvent the wheel if someone has already created a solid implementation.
I’m looking for something like the code for cublasSaxpy() in the CUBLAS library… A fast kernel that performs X=X+Y would be great. (With matrices or vectors X,Y)