Using texture to speed up sparse matrix mult?

AndreasBuhr · April 30, 2007, 10:57am

Hello all,

I can’t get the texture unit working as expected.

I am new to graphics programming, so maybe I did not understand the texture-concept correctly.

I am implementing a sparse-matrix multiplication. This comes along with random memory access, i.e. not coalesced. The CUDA programming guide suggests using the texture unit for this. So I replaced:

(source_vector is not sparse)

source_vector[index]

with

cudaBindTexture( tex_source, source_vector, channelDescm_source, size,0) );

// some code using:

texfetch(tex_source, index)

cudaUnbindTexture( tex_source)

It works perfectly in device emulation mode, but not on the card. I get wrong results. Is there something wrong with this way of using textures? The speedup I see is approx. 2.

thanks a lot in advance,

Andreas

AndreasBuhr · April 30, 2007, 2:15pm

Solved (after three hours of debugging):

source_vector was not aligned to 256 bytes.
It seems you can only bind textures to memory locations aligned to 256 bytes.

gserban · May 1, 2007, 1:43am

I am curious, how fast does it work? I implemented a SpMV in CSR and I got only 1.4GFlops …

AndreasBuhr · May 2, 2007, 10:37am

Hi,

GFlops are not an appropriate unit in this context. The limiting factor is the memory access speed.

Some numbers (all approx):
Aligned copy from global memory to global memory: 52 GB/s
Unaligned copy from gobal memory to global memory: 7 GB/s
Copy with unaligned read from texture, aligned write to global memory: 18 GB/s
(which could be 52 GB/s write and 11 GB/s read)

With these numbers, you end up at speeds between less than 1 GFlops if all operations are unaligned, and more that 3 GFlops if all operations are aligned.

Unaligned reads are hell expensive, unaligned writes are even worse :(

gserban · May 2, 2007, 3:16pm

Thanks for the numbers, it’s quite useful to have something for reference. But still, for the case of the dot product which is the most memory bound of them all, more memory than SpMV and SpMM, I got 20GFlops in CUBLAS, so that’s 80GB/s bandwidth. I was thinking, if I can get the memory alignment right, I should get more than 3GFlops … I hope …

Mordakhay · May 10, 2007, 8:06am

can anyone help me to implement SpMV in CSR where the data is stored in separate file (data.txt). and this is the first time I use graphical programming so, kindly anyone who will help me be more clear External Media

thanks much for all.