Best way to invoke custom kernel among CUBLAS?


Apologies for a double post, but when I wrote this:

I was under the impression that what I was facing was a mostly Windows-specific problem, but I suspect it’s more general than that now, and perhaps my whole approach is wrong.

The problem: I’m implementing an algorithm that fits almost perfectly into the functionality of CUBLAS, except for one operation. I have to do a piecewise hyperbolic tangent operation on a single vector. Right now I’m downloading the vector off the device, computing tanh() on the CPU, and uploading it again. After profiling I’ve noticed that this operation kills any speed advantage that CUDA gives me.

It seems like The Thing To Do is to write a trivial CUDA kernel that performs tanhf() on a vector. How is the best way to go about integrating this kernel into my program? Should I be writing a completely separate program that compiles to a .dll and is called by my CUBLAS application, or can the kernel be conveniently integrated into the same project? I took a peek at the CUBLAS source, and it looks like it’s making decisions about whether to put things in texture memory or not, and this makes me wonder if that feature complicates my task at all. Is there a stub program out there that demonstrates a custom kernel operating on a CUBLAS result? That would certainly answer a lot of the low-level questions implied by the ones stated above (i.e., do you call CUT_DEVICE_INIT() if you’ve already called cublasInit()?).

I know it’s possible to invoke a kernel from C++ (CUBLAS does it :) ), but I’m just at a lack of a starting point on how to go about this.

Any advice appreciated!

I know this is almost 10 years later. I am trying to do the same thing. I am in the exact same place as you describe in your post. Did you ever figure it out? I don’t want to use cuDNN or other libraries. Any help would be appreciated.

It’s not difficult to mix cublas calls and kernel calls in the same file.

A kernel to do tanh elementwise on a vector is trivial.

Thanks for replying.

I am new to cuda programming. I have only used cublas. Until you mentioned “kernel” I had no idea where to look. I am new. I have found some simple examples, like, which even gives me instructions to set it up with Visual Studio. It is a bit old, but I believe I can figure it out for VS 2015.

Thank you again.

a good starting point for learning about these things is all the documentation that nvidia maintains at

click on the CUDA docs

Hey DonSolo. You’re internet famous!

i hear nvidia are in to AI a little bit, but dont provide a simple tanh or sigmoid or piecewise linear to go along with their gemms.


its the most obvious thing to anyone who uses cuda with custom neural networks. one day when NVidia really start to really get in to AI they might include it. I have been here 10 years and still write my activation kernels because… noone at nvidia uses their own tools otherwise someone would have added it long ago.