Hello,
Apologies for a double post, but when I wrote this:
http://forums.nvidia.com/index.php?showtopic=62349
I was under the impression that what I was facing was a mostly Windows-specific problem, but I suspect it’s more general than that now, and perhaps my whole approach is wrong.
The problem: I’m implementing an algorithm that fits almost perfectly into the functionality of CUBLAS, except for one operation. I have to do a piecewise hyperbolic tangent operation on a single vector. Right now I’m downloading the vector off the device, computing tanh() on the CPU, and uploading it again. After profiling I’ve noticed that this operation kills any speed advantage that CUDA gives me.
It seems like The Thing To Do is to write a trivial CUDA kernel that performs tanhf() on a vector. How is the best way to go about integrating this kernel into my program? Should I be writing a completely separate program that compiles to a .dll and is called by my CUBLAS application, or can the kernel be conveniently integrated into the same project? I took a peek at the CUBLAS source, and it looks like it’s making decisions about whether to put things in texture memory or not, and this makes me wonder if that feature complicates my task at all. Is there a stub program out there that demonstrates a custom kernel operating on a CUBLAS result? That would certainly answer a lot of the low-level questions implied by the ones stated above (i.e., do you call CUT_DEVICE_INIT() if you’ve already called cublasInit()?).
I know it’s possible to invoke a kernel from C++ (CUBLAS does it :) ), but I’m just at a lack of a starting point on how to go about this.
Any advice appreciated!