when should I use constant cache for speeding up

zjx1020 · October 27, 2010, 8:29am

[codebox]warp_id = tid / 32;

for(int i=warp_id; i<num_combine_rows; i+=num_warps)

{

int row_start = row[i];

int row_end = row[i+1];

for(int j=row_start; j<row_end; j+=32)

…

}[/codebox]

All the threads in one warp access the same address in array row, this would result in a uncoalesced access.

But I get no speeding up if I store value of row in constant cache.

Doesn’t this situation satisfy the condition of using constant cache?

Do I misunderstand the coalesced access? This situation won’t result in uncoalesced access.

zjx1020 · October 27, 2010, 8:29am

[codebox]warp_id = tid / 32;

for(int i=warp_id; i<num_combine_rows; i+=num_warps)

{

int row_start = row[i];

int row_end = row[i+1];

for(int j=row_start; j<row_end; j+=32)

…

}[/codebox]

All the threads in one warp access the same address in array row, this would result in a uncoalesced access.

But I get no speeding up if I store value of row in constant cache.

Doesn’t this situation satisfy the condition of using constant cache?

Do I misunderstand the coalesced access? This situation won’t result in uncoalesced access.

eyalhir74 · October 27, 2010, 9:17am

This is from the programming guide, section G.3.2.2 (cuda version 3.1):

“Threads can access any words in any order, including the same words, and a single memory transaction for each segment addressed by the half-warp is issued”

So using cache won’t boost things :)

eyal

eyalhir74 · October 27, 2010, 9:17am

This is from the programming guide, section G.3.2.2 (cuda version 3.1):

“Threads can access any words in any order, including the same words, and a single memory transaction for each segment addressed by the half-warp is issued”

So using cache won’t boost things :)

eyal

happyjack272 · October 27, 2010, 1:31pm

firstly instead of “warp_id = tid / 32;”, it would be faster to do “warp_id = tid >> 5;”. but “warp_id=blockIdx.x” would be even better. speaking of which, if you have 32 threads per block, warp_id will always = 0 in the current code.

but anyways…

constant memory is no faster than shared memory. they are the same speed. the only difference is their scope and access. constant is shared amongst the entire gpu and is read-only. shared is shared only w/in a block and is read/write. so make decisions on which one to use where based on that. i generally use constant memory instead of shared whenever i can because i find shared memory to always be in short supply and i can’t find much to put in constant memory anyways.

happyjack272 · October 27, 2010, 1:31pm

firstly instead of “warp_id = tid / 32;”, it would be faster to do “warp_id = tid >> 5;”. but “warp_id=blockIdx.x” would be even better. speaking of which, if you have 32 threads per block, warp_id will always = 0 in the current code.

but anyways…

constant memory is no faster than shared memory. they are the same speed. the only difference is their scope and access. constant is shared amongst the entire gpu and is read-only. shared is shared only w/in a block and is read/write. so make decisions on which one to use where based on that. i generally use constant memory instead of shared whenever i can because i find shared memory to always be in short supply and i can’t find much to put in constant memory anyways.

zjx1020 · October 28, 2010, 1:29am

Thank you!

But my version is 2.3

zjx1020 · October 28, 2010, 1:29am

Thank you!

But my version is 2.3

mkaushik · October 28, 2010, 3:22pm

Why not? Even though there will be just one transaction for the warp, constant cache accesses have much lower latency than global reads.

I don’t know how the old cuda version plays a part here.

mkaushik · October 28, 2010, 3:22pm

Why not? Even though there will be just one transaction for the warp, constant cache accesses have much lower latency than global reads.

I don’t know how the old cuda version plays a part here.

Sarnath · October 29, 2010, 6:12am

Constant cache still is backed up by a region of “global mem”… No?
So, the first access will result in gmem read…
Only repeated broadcast like access will give you the benefit.

Sarnath · October 29, 2010, 6:12am

Constant cache still is backed up by a region of “global mem”… No?
So, the first access will result in gmem read…
Only repeated broadcast like access will give you the benefit.

mkaushik · October 29, 2010, 4:13pm

Yeah you’re right, just a single access is not going to show any improvement.

mkaushik · October 29, 2010, 4:13pm

Yeah you’re right, just a single access is not going to show any improvement.

Topic		Replies	Views
Warp Serialisation and Constant Memory Performance Surprise CUDA Programming and Performance	7	3925	March 3, 2009
Constant memory accesses by a threads of a half warp CUDA Programming and Performance	10	1936	October 27, 2014
Constant memory access Using banks like the shared memory? CUDA Programming and Performance	4	4498	January 6, 2009
Constant Arrays CUDA Programming and Performance	13	30618	November 24, 2007
Coalescing of cached constant memory CUDA Programming and Performance	5	7367	April 25, 2011
Constant memory 64KB only 8KB usable? CUDA Programming and Performance	13	7065	February 15, 2011
Small const array accessable globally? Is it easy and possible? CUDA Programming and Performance	6	1452	April 16, 2009
Global memory broadcasting? CUDA Programming and Performance	4	5748	October 2, 2008
constant cache CUDA Programming and Performance	3	2247	April 24, 2014
Uncoalesced on matrix by vector multiplication CUDA Programming and Performance	3	8009	June 24, 2009

when should I use constant cache for speeding up

Related topics