I use the cublasSaxpy function but the time I get is a little slower than CPU;
Another method, I put A and B into texture and then sub them in the kernel;
But slower than CPU either;
Is there any efficient approach to get the result, I mean much faster than CPU?
Don’t use a for loop - make each thread read one pixel. Check the examples, there’s probably something very similar there. The matrix multiplication example in the manual is not bad, but more complex than the subtraction you’re doing, so it might confuse.