Access to faster memory

Hi mates,

i’m using a kernel that is accessing intensively to a single row of a matrix per thread, with that row is different from thread to thread.

Can i cache that row in any way? i don’t want to use shared memory, cause each row of each thread isn’t shared between threads of same group. The row will be an array of float, not so long, about 17-25 elements.

Thanks for help

You could try binding the data to a texture. Texture loads come from global memory, but include a small on die read cache. If you access patterns are pretty compact, then you will probably get a useful improvement over just using global memory loads alone.

I think it’s a good idea, anyway my kind of data is a bit larger than a simple float or int. I’m using size_t (unsigned long in my architecture), and it’s size is equal to 2 int. How can i manage it? All i found is how to allocate simple data (float, int…), and i think i must allocate something like 2 ints and then reinterpret as long, am i wrong?

Thanks for replies