How to save a big data(4M, larger than constant memory) wihch is frequently used by every thread lik

HongXiang · October 25, 2013, 4:28pm

Hi, I’m dealing with a PDE solver problem.
In every thread I need to multiply a matrix with a vector.
The vector is the solution data(can be store in global memory). And the matrix is an constant, it is the same for every thread.
However, the matrix size is about 1024*1024, which requires at least 4M memory, much larger than the 64K constant memory.
Since I need to read it in every step of calculation, it will decrease program performance largely if I put it into the global memory.
Is there any method to solve this problem?
Thanks.

pasoleatis · October 25, 2013, 4:42pm

Hello,

Constant memory is only effective when there only a few numbers. You can put it in the global memory and bind it to textures or use share memory. The compute capability 2.0 and 3.x have L1 and L2 cache which are so effective you do not need many optimizations. also if the memory access is coalesced the transfer rate is higher than in the case of constant memory. I suggest a combination of shared memory and coalesced access.

HongXiang · October 26, 2013, 3:10am

Thank you very much.
Since shared memory is also small, I don’t know how to put it in.
And I’m a begginer in CUDA, I have no idea about textures, L1, L2 cache, and coalesced access.
I’m reading David B.Kirk and Wen-mei W.Hwu 's Programming Massively Parallel Processors. I have not found such concepts yet. Is there any suggest reading about these concepts?

pasoleatis · October 26, 2013, 1:31pm

Hello,

You do not put all in the shared memory at once, but rather load the portions you need for each specific block.
For dense matrix-vector multiplication you can use cublas library. Or if you want to implement yourself there many example on the net about how to do it efficient (just do a google search CUDA matrix-vector multiplication).
The L1 cache and shared memory are physically the same. The compiler optimizes the L1 and L2 cache so efficient that in some cases there is no benefit from doing optimization yourself, but they are present only in the Fermi or newer cards.
The constant memory is only useful when you have only a few constants. The textures are special special units which are bound to an array located in the global memory, but they are optimized for random accesses.

I suggest to read ‘Cuda by example’, though it is rather old, but it gives a good idea. The CUDA Programming Guide has clear description and example code.

HongXiang · October 26, 2013, 3:44pm

I’m very appreciate for your patient.
Your post do help a lot.

Topic		Replies	Views
How to deal with a data which need to be read for every thread frequently and larger than constant m Teaching & Curriculum Support	1	1119	October 27, 2013
Small const array accessable globally? Is it easy and possible? CUDA Programming and Performance	6	1536	April 16, 2009
Constants vs Texture Memory CUDA Programming and Performance	4	7480	February 21, 2007
Warp Serialisation and Constant Memory Performance Surprise CUDA Programming and Performance	7	4003	March 3, 2009
Constant or Texture Memory Which is better for my application? CUDA Programming and Performance	3	2448	November 16, 2007
Constant Arrays CUDA Programming and Performance	13	30819	November 24, 2007
Should I use constant memory or Texture? CUDA Programming and Performance	8	11671	February 20, 2008
Slow local memory, feigned constant memory. coalesced? global? CUDA Programming and Performance	29	7486	January 25, 2010
how to use large data (some MB) in CUDA efficiently CUDA Programming and Performance	2	531	December 5, 2017
Should I use shared, constant or texture memory for this application? CUDA Programming and Performance	2	330	June 10, 2023

How to save a big data(4M, larger than constant memory) wihch is frequently used by every thread lik

Related topics