cublas and slow global memory accesses

Hi guys,

There are two major issues that I’m hitting when using CUBLAS at the moment.

a) I want to put a sigmoid function on top of the results from a CUBLAS sgemm but currently I’m doing the sigmoid in my own kernel afterwards with an unnecessary read&write to global memory. Is there any way to avoid the extra read/write to global memory?

b) I’m elsewhere updating various individual columns in a large matrix using an axpy call (very slow) - and then doing a very fast matrix multiply. Ideally, I want to do a redirect on where the matrix is pulling various columns in from when it first reads it in. What’s the best approach to take here?

Dreadful latency is just killing me with CUBLAS at the moment and I’m about to abandon it and write my own kernels where I can restrict the number of memory accesses.

Your expertise and advice would be very much appreciated. Thanks in advance!!

Maybe I can help with a)

Don’t you have to use arrays that have already been copied onto device memory to use CUBLAS?

In this case you shouldn’t need to recopy the memory to the host and then back to the device just to run another kernel.

You should be able to pass the same pointer to your own kernel that you gave to the CUBLAS function.

For a) are you really losing a lot of performance from the read/write to global memory? The SGEMM call should be floating point bottlenecked for relatively large matrix sizes and the extra read/writes to global memory should be almost free in comparison.

For b), why not update all of the columns simultaneously?