How to deal with a data which need to be read for every thread frequently and larger than constant m

Hi, I’m dealing with a PDE solver problem.
In every thread I need to multiply a matrix with a vector.
The vector is the solution data(can be store in global memory). And the matrix is an constant, it is the same for every thread.
However, the matrix size is about 1024*1024, which requires at least 4M memory, much larger than the 64K constant memory.
Since I need to read it in every step of calculation, it will decrease program performance largely if I put it into the global memory.
Is there any method to solve this problem?
Thanks.

I’m sorry since it seems that I put this question in a wrong board.

Quick answer : Use texture memory. Indeed it is in the wrong area put this question. Also if you put it as const from Fermi onwards it does broadcasting into threads that require the same location in global memory. Test both.