Multiple memory access

francy300485 · February 20, 2012, 11:42am

Hi, I have this problem.
I have a vector of 9 nine elements containing the coefficients of a polynomial, and I want to evaluate this polynomial on all points of volume, in prallel. In order to do this I would associate one thread to each volume element. However doing so, each thread have to read all the 9 coefficients. I think that the access would be serialized, what can I do to avoid this? I mean, how threads can share the coefficients value?
It would be a good idea copy the cofficient in a shared memory vector?
Thanks in advance

pQB · February 20, 2012, 12:03pm

If all the threads read each coefficient at the same time, the best place for those data is constant memory, because a simple read from constant memory will be broadcast to all the threads. Note that all the threads must read the same element or the constant memory access would not as good as expected. Furthermore, the constant memory is cached. Take a look at

Regards.

RezaRob3 · February 20, 2012, 1:58pm

Shared memory also broadcasts to multiple threads reading the same address, and with compute capability 2 “multiple words can be broadcast in a single transaction.”

On Fermi everything is cached.

pQB · February 20, 2012, 3:38pm

That’s true but for a few coefficients, load data from global to shared memory would keep too much threads idle while reading the data from constant memory, from my point of view, is more efficient.

Said that, as RezaRob3 commented, on Fermi architecture everything is cached, so if you do not reuse the coefficient, read them directly from global memory could be good enough as data will be cached.

Regards.

francy300485 · February 20, 2012, 3:58pm

Hi, thank you so much.

Yes, I know that everything is cached on Fermi architecture, and actually my first solution was used directly global memory, I was wondering if using global memory there could be some syncrhonization problems, which could be avoided using registers or whatever. Currently I’m trying to use the costant memory for my coefficients, since I agree with you pQB, for so few elements, use shared memory would be less efficient.

Thanks.

Topic		Replies	Views
Constant memory access Using banks like the shared memory? CUDA Programming and Performance	4	4525	January 6, 2009
Global memory broadcasting? CUDA Programming and Performance	4	5788	October 2, 2008
Small const array accessable globally? Is it easy and possible? CUDA Programming and Performance	6	1487	April 16, 2009
Reading the same memory with many threads CUDA Programming and Performance	6	2134	January 29, 2009
constant vs shared memory CUDA Programming and Performance	2	23407	February 23, 2007
Memory coalescing in one thread CUDA Programming and Performance	17	16734	March 31, 2011
Optimizing App Memory Bandwidth Requirements Optimizing App Memory Bandwidth Requirem CUDA Programming and Performance	7	7670	May 7, 2008
Storage of variable reading by every thread CUDA Programming and Performance	4	4283	March 18, 2010
Shared memory question CUDA Programming and Performance	27	7531	June 23, 2008
global mem reads coalesced per block or warp? CUDA Programming and Performance	5	5548	March 6, 2007

Multiple memory access

Related topics