I am new to graphics programming, so maybe I did not understand the texture-concept correctly.
I am implementing a sparse-matrix multiplication. This comes along with random memory access, i.e. not coalesced. The CUDA programming guide suggests using the texture unit for this. So I replaced:
It works perfectly in device emulation mode, but not on the card. I get wrong results. Is there something wrong with this way of using textures? The speedup I see is approx. 2.
GFlops are not an appropriate unit in this context. The limiting factor is the memory access speed.
Some numbers (all approx):
Aligned copy from global memory to global memory: 52 GB/s
Unaligned copy from gobal memory to global memory: 7 GB/s
Copy with unaligned read from texture, aligned write to global memory: 18 GB/s
(which could be 52 GB/s write and 11 GB/s read)
With these numbers, you end up at speeds between less than 1 GFlops if all operations are unaligned, and more that 3 GFlops if all operations are aligned.
Unaligned reads are hell expensive, unaligned writes are even worse :(
Thanks for the numbers, it’s quite useful to have something for reference. But still, for the case of the dot product which is the most memory bound of them all, more memory than SpMV and SpMM, I got 20GFlops in CUBLAS, so that’s 80GB/s bandwidth. I was thinking, if I can get the memory alignment right, I should get more than 3GFlops … I hope …
can anyone help me to implement SpMV in CSR where the data is stored in separate file (data.txt). and this is the first time I use graphical programming so, kindly anyone who will help me be more clear External Media