I have several experiments on this work but can’t get good result.
suppose that A B C;
for(int i=0;i<length; i++)
C[i] = A[i] - B[i];
I use the cublasSaxpy function but the time I get is a little slower than CPU;
Another method, I put A and B into texture and then sub them in the kernel;
But slower than CPU either;
Is there any efficient approach to get the result, I mean much faster than CPU?