There are two major issues that I’m hitting when using CUBLAS at the moment.
a) I want to put a sigmoid function on top of the results from a CUBLAS sgemm but currently I’m doing the sigmoid in my own kernel afterwards with an unnecessary read&write to global memory. Is there any way to avoid the extra read/write to global memory?
b) I’m elsewhere updating various individual columns in a large matrix using an axpy call (very slow) - and then doing a very fast matrix multiply. Ideally, I want to do a redirect on where the matrix is pulling various columns in from when it first reads it in. What’s the best approach to take here?
Dreadful latency is just killing me with CUBLAS at the moment and I’m about to abandon it and write my own kernels where I can restrict the number of memory accesses.
Your expertise and advice would be very much appreciated. Thanks in advance!!