Really slow constant memory Random access to constant memory

Philipp82 · December 2, 2009, 8:19am

Hi all,

In my program, all threads must read two different float values (randomly) from an array[2000].
I tryed to use different memory types for the same problem, so I stored the array in global memory, which was quite fast, then in shared mem, which was a bit faster, and in constant memory, which was really slow. Of course, I get the same result for all three kernels.

The kernel launch takes like ten times the processing time, when using constant memory, compared to global memory. Is this what you would expect for random access in constant memory, or is my code corrupt? I thought that in principle the runtime should be better for constant memory, because I just read from the memory…

Thanks for any help,
Philipp.

Cygnus_X1 · December 2, 2009, 8:43am

According to Programming Guide (at least what I understood) constant memory is fast if all threads are reading the same value from constant memory. If you are however reading different values, every read is sequentialised.

Here is the line:

Philipp82 · December 2, 2009, 9:42am

I read this, but how can that be slower than reading from global memory?

Cygnus_X1 · December 2, 2009, 11:20am

If you read:
int reg=globalArray[threadIdx.x]
this is resolved in one memory transaction.

However if you write
int reg=constantArray[threadIdx.x]
this will resolve in 16 transactions per half-warp

Philipp82 · December 2, 2009, 12:15pm

Wow. I guess shared memory will be the best… Thx

cudesnick · December 2, 2009, 9:47pm

Philipp,

I used textures for exactly this purpose: storing a look-up table. Due to tight memory constraints, I couldn’t afford to store these look-up data in the shared memory, which would probably have been faster. Textures are cached, hence the performance gain vs global memory. From what I understand, texture cache is not optimized the same way as constant memory cache is: it’s not required that all threads of a warp fetch the same value on a given cycle. As far as I remember, texture cache is 8K, so make sure your data fits into it.

In case you don’ t have experience with textures, here’s how it works: you copy your look-up table data from host to device global memory in the normal way, but access those data in the kernel code by means of invoking texture fetch functions (as opposed to de-refencing a global memory pointer). In addition to this, you have to “bind” the textures to the global memory after this memory is cudaMalloc’ed, and prior to the kernel invocation. Programming guide has a fairly clear explanation of this technique.

Good luck, and I if you try it, please, report the results.

Philipp82 · December 3, 2009, 8:30am

Philipp,

I used textures for exactly this purpose: storing a look-up table. Due to tight memory constraints, I couldn’t afford to store these look-up data in the shared memory, which would probably have been faster. Textures are cached, hence the performance gain vs global memory. From what I understand, texture cache is not optimized the same way as constant memory cache is: it’s not required that all threads of a warp fetch the same value on a given cycle. As far as I remember, texture cache is 8K, so make sure your data fits into it.

In case you don’ t have experience with textures, here’s how it works: you copy your look-up table data from host to device global memory in the normal way, but access those data in the kernel code by means of invoking texture fetch functions (as opposed to de-refencing a global memory pointer). In addition to this, you have to “bind” the textures to the global memory after this memory is cudaMalloc’ed, and prior to the kernel invocation. Programming guide has a fairly clear explanation of this technique.

Good luck, and I if you try it, please, report the results.

Hi,

Well I am going to try it, but it looks pretty complicated for a “non-information-scientist / rookie programmer”… :">

Philipp82 · December 3, 2009, 2:58pm

Ok, I got it running.
So the execution time for the different ways depends strongly of the amount of values that are considered. Actually it is not a lookup table in my case, but I call it like that now, because it’s the same principle.

For 3200 values in the lookup table, reading from global memory is the slowest, from texture memory is two times faster and reading from shared memory is in between. Const. memory is really much slower as I allready wrote.
For less values in the lookup table, eg 800, reading from global memory remains the slowest, but readig from shared memory, texture memory and even constant memory are nearly the same in speed, about 2-3 times faster then the global memory case. Texture memory reads are still the fastes.
3)For even less values (200) its nearly the same, but const memory becomes the fastest, beeing nearly 8 times faster then the global memory read, texture memory is only 6 times faster and shared mem is about 4 times faster.

This is I guess what should be expected, global memory is slow anyways and becones even slower when bank conflicts become more likely. For smaller arrays const memory becomes faster, because parallel reads from one block become more likely…

Philipp.

Cygnus_X1 · December 3, 2009, 3:07pm

But shared memory should be about hundret times faster that global, not only twice as fast…

Philipp82 · December 3, 2009, 3:14pm

That depends. The times I wrote are execution times of the whole kernel. The data has to be copied to the shared mem from global memory at the beginning of the kernel.

eyalhir74 · December 3, 2009, 4:12pm

You might want to post the kernel code and kernel invocation/benchmarking code so to verify there

is nothing wrong there.

eyal

Philipp82 · December 3, 2009, 4:47pm

Look, when I got 3200 points in my lookup table, then with a blocksize of lets say 400, every thread has to copy 8 values from global memory to shared memory. Then the random excess to the shared mem starts.

When I read directly from global memory, every thread has to excess just 2 values from global memory.

This is the reason, why it’s not that much faster using shared memory.

avidday · December 3, 2009, 6:49pm

Constant memory has cache, so the improved performance at small sizes is probably down to fewer cache misses.

cudesnick · December 4, 2009, 6:58pm

Philipp’s analysis induces the following question:

Say, there are several blocks executing on each multi-processor. Say, all these blocks retrieve data through texture fetches from the same texture. Say, the whole textured data-set fits into the texture cache, but only once, i.e. this data-set is less than, but comparable to 8K in size.

Question: in the above scenario, is texture cache smart enough to hold only one instance of each data-point, even though the same data-point is accessed chaotically from different blocks, that run on the same multi-processor?

Topic		Replies	Views
Constant Arrays CUDA Programming and Performance	13	30686	November 24, 2007
Why texture/constant memory under FERMI architecture CUDA Programming and Performance	23	4175	November 3, 2010
Warp Serialisation and Constant Memory Performance Surprise CUDA Programming and Performance	7	3955	March 3, 2009
Constant or Texture Memory Which is better for my application? CUDA Programming and Performance	3	2422	November 16, 2007
How to choose the good memory CUDA Programming and Performance	2	4272	December 7, 2007
Speed of Constant memory over Textures CUDA Programming and Performance	2	6994	December 24, 2009
Constant memory access Using banks like the shared memory? CUDA Programming and Performance	4	4518	January 6, 2009
Constant memory usage and comparison against textures CUDA Programming and Performance	9	4133	December 24, 2008
const vs shared speed CUDA Programming and Performance	2	4625	August 30, 2007
Small const array accessable globally? Is it easy and possible? CUDA Programming and Performance	6	1484	April 16, 2009

Really slow constant memory Random access to constant memory

Related topics