when should I use constant cache for speeding up

[codebox]warp_id = tid / 32;

for(int i=warp_id; i<num_combine_rows; i+=num_warps)

{

int row_start = row[i];

int row_end = row[i+1];

for(int j=row_start; j<row_end; j+=32)

}[/codebox]

All the threads in one warp access the same address in array row, this would result in a uncoalesced access.

But I get no speeding up if I store value of row in constant cache.

Doesn’t this situation satisfy the condition of using constant cache?

Do I misunderstand the coalesced access? This situation won’t result in uncoalesced access.

[codebox]warp_id = tid / 32;

for(int i=warp_id; i<num_combine_rows; i+=num_warps)

{

int row_start = row[i];

int row_end = row[i+1];

for(int j=row_start; j<row_end; j+=32)

}[/codebox]

All the threads in one warp access the same address in array row, this would result in a uncoalesced access.

But I get no speeding up if I store value of row in constant cache.

Doesn’t this situation satisfy the condition of using constant cache?

Do I misunderstand the coalesced access? This situation won’t result in uncoalesced access.

This is from the programming guide, section G.3.2.2 (cuda version 3.1):

“Threads can access any words in any order, including the same words, and a single memory transaction for each segment addressed by the half-warp is issued”

So using cache won’t boost things :)

eyal

This is from the programming guide, section G.3.2.2 (cuda version 3.1):

“Threads can access any words in any order, including the same words, and a single memory transaction for each segment addressed by the half-warp is issued”

So using cache won’t boost things :)

eyal

firstly instead of “warp_id = tid / 32;”, it would be faster to do “warp_id = tid >> 5;”. but “warp_id=blockIdx.x” would be even better. speaking of which, if you have 32 threads per block, warp_id will always = 0 in the current code.

but anyways…

constant memory is no faster than shared memory. they are the same speed. the only difference is their scope and access. constant is shared amongst the entire gpu and is read-only. shared is shared only w/in a block and is read/write. so make decisions on which one to use where based on that. i generally use constant memory instead of shared whenever i can because i find shared memory to always be in short supply and i can’t find much to put in constant memory anyways.

firstly instead of “warp_id = tid / 32;”, it would be faster to do “warp_id = tid >> 5;”. but “warp_id=blockIdx.x” would be even better. speaking of which, if you have 32 threads per block, warp_id will always = 0 in the current code.

but anyways…

constant memory is no faster than shared memory. they are the same speed. the only difference is their scope and access. constant is shared amongst the entire gpu and is read-only. shared is shared only w/in a block and is read/write. so make decisions on which one to use where based on that. i generally use constant memory instead of shared whenever i can because i find shared memory to always be in short supply and i can’t find much to put in constant memory anyways.

Thank you!

But my version is 2.3

Thank you!

But my version is 2.3

Why not? Even though there will be just one transaction for the warp, constant cache accesses have much lower latency than global reads.

I don’t know how the old cuda version plays a part here.

Why not? Even though there will be just one transaction for the warp, constant cache accesses have much lower latency than global reads.

I don’t know how the old cuda version plays a part here.

Constant cache still is backed up by a region of “global mem”… No?
So, the first access will result in gmem read…
Only repeated broadcast like access will give you the benefit.

Constant cache still is backed up by a region of “global mem”… No?
So, the first access will result in gmem read…
Only repeated broadcast like access will give you the benefit.

Yeah you’re right, just a single access is not going to show any improvement.

Yeah you’re right, just a single access is not going to show any improvement.