The approach to get the sub of two matrix


I have several experiments on this work but can’t get good result.

suppose that A[256] B[256] C[256];

for(int i=0;i<length; i++)
C[i] = A[i] - B[i];

I use the cublasSaxpy function but the time I get is a little slower than CPU;
Another method, I put A and B into texture and then sub them in the kernel;
But slower than CPU either;

Is there any efficient approach to get the result, I mean much faster than CPU?

Don’t use a for loop - make each thread read one pixel. Check the examples, there’s probably something very similar there. The matrix multiplication example in the manual is not bad, but more complex than the subtraction you’re doing, so it might confuse.

I didn’t use for loop,that’s for CPU.

I get the blas method from the example but slower than CPU.

I also use tex2D(texture,x,y) to make each thread read one element,but much slower.

OK, sorry :) Didn’t mean to imply you didn’t know what you were doing.

What are your block sizes? How big is the data? Are you doing coalesced global reads?

You could use tex1Dfetch instead of tex2D to improve performance.