How to save a big data(4M, larger than constant memory) wihch is frequently used by every thread lik

Hi, I’m dealing with a PDE solver problem.
In every thread I need to multiply a matrix with a vector.
The vector is the solution data(can be store in global memory). And the matrix is an constant, it is the same for every thread.
However, the matrix size is about 1024*1024, which requires at least 4M memory, much larger than the 64K constant memory.
Since I need to read it in every step of calculation, it will decrease program performance largely if I put it into the global memory.
Is there any method to solve this problem?


Constant memory is only effective when there only a few numbers. You can put it in the global memory and bind it to textures or use share memory. The compute capability 2.0 and 3.x have L1 and L2 cache which are so effective you do not need many optimizations. also if the memory access is coalesced the transfer rate is higher than in the case of constant memory. I suggest a combination of shared memory and coalesced access.

Thank you very much.
Since shared memory is also small, I don’t know how to put it in.
And I’m a begginer in CUDA, I have no idea about textures, L1, L2 cache, and coalesced access.
I’m reading David B.Kirk and Wen-mei W.Hwu 's Programming Massively Parallel Processors. I have not found such concepts yet. Is there any suggest reading about these concepts?


You do not put all in the shared memory at once, but rather load the portions you need for each specific block.
For dense matrix-vector multiplication you can use cublas library. Or if you want to implement yourself there many example on the net about how to do it efficient (just do a google search CUDA matrix-vector multiplication).
The L1 cache and shared memory are physically the same. The compiler optimizes the L1 and L2 cache so efficient that in some cases there is no benefit from doing optimization yourself, but they are present only in the Fermi or newer cards.
The constant memory is only useful when you have only a few constants. The textures are special special units which are bound to an array located in the global memory, but they are optimized for random accesses.

I suggest to read ‘Cuda by example’, though it is rather old, but it gives a good idea. The CUDA Programming Guide has clear description and example code.

I’m very appreciate for your patient.
Your post do help a lot.