Really slow constant memory Random access to constant memory

Hi all,

In my program, all threads must read two different float values (randomly) from an array[2000].
I tryed to use different memory types for the same problem, so I stored the array in global memory, which was quite fast, then in shared mem, which was a bit faster, and in constant memory, which was really slow. Of course, I get the same result for all three kernels.

The kernel launch takes like ten times the processing time, when using constant memory, compared to global memory. Is this what you would expect for random access in constant memory, or is my code corrupt? I thought that in principle the runtime should be better for constant memory, because I just read from the memory…

Thanks for any help,
Philipp.

According to Programming Guide (at least what I understood) constant memory is fast if all threads are reading the same value from constant memory. If you are however reading different values, every read is sequentialised.

Here is the line:

I read this, but how can that be slower than reading from global memory?

If you read:
int reg=globalArray[threadIdx.x]
this is resolved in one memory transaction.

However if you write
int reg=constantArray[threadIdx.x]
this will resolve in 16 transactions per half-warp

Wow. I guess shared memory will be the best… Thx

Philipp,

I used textures for exactly this purpose: storing a look-up table. Due to tight memory constraints, I couldn’t afford to store these look-up data in the shared memory, which would probably have been faster. Textures are cached, hence the performance gain vs global memory. From what I understand, texture cache is not optimized the same way as constant memory cache is: it’s not required that all threads of a warp fetch the same value on a given cycle. As far as I remember, texture cache is 8K, so make sure your data fits into it.

In case you don’ t have experience with textures, here’s how it works: you copy your look-up table data from host to device global memory in the normal way, but access those data in the kernel code by means of invoking texture fetch functions (as opposed to de-refencing a global memory pointer). In addition to this, you have to “bind” the textures to the global memory after this memory is cudaMalloc’ed, and prior to the kernel invocation. Programming guide has a fairly clear explanation of this technique.

Good luck, and I if you try it, please, report the results.

Hi,

Well I am going to try it, but it looks pretty complicated for a “non-information-scientist / rookie programmer”… :">

Ok, I got it running.
So the execution time for the different ways depends strongly of the amount of values that are considered. Actually it is not a lookup table in my case, but I call it like that now, because it’s the same principle.

  1. For 3200 values in the lookup table, reading from global memory is the slowest, from texture memory is two times faster and reading from shared memory is in between. Const. memory is really much slower as I allready wrote.
  2. For less values in the lookup table, eg 800, reading from global memory remains the slowest, but readig from shared memory, texture memory and even constant memory are nearly the same in speed, about 2-3 times faster then the global memory case. Texture memory reads are still the fastes.
    3)For even less values (200) its nearly the same, but const memory becomes the fastest, beeing nearly 8 times faster then the global memory read, texture memory is only 6 times faster and shared mem is about 4 times faster.

This is I guess what should be expected, global memory is slow anyways and becones even slower when bank conflicts become more likely. For smaller arrays const memory becomes faster, because parallel reads from one block become more likely…

Philipp.

But shared memory should be about hundret times faster that global, not only twice as fast…

That depends. The times I wrote are execution times of the whole kernel. The data has to be copied to the shared mem from global memory at the beginning of the kernel.

You might want to post the kernel code and kernel invocation/benchmarking code so to verify there

is nothing wrong there.

eyal

Look, when I got 3200 points in my lookup table, then with a blocksize of lets say 400, every thread has to copy 8 values from global memory to shared memory. Then the random excess to the shared mem starts.

When I read directly from global memory, every thread has to excess just 2 values from global memory.

This is the reason, why it’s not that much faster using shared memory.

Constant memory has cache, so the improved performance at small sizes is probably down to fewer cache misses.

Philipp’s analysis induces the following question:

Say, there are several blocks executing on each multi-processor. Say, all these blocks retrieve data through texture fetches from the same texture. Say, the whole textured data-set fits into the texture cache, but only once, i.e. this data-set is less than, but comparable to 8K in size.

Question: in the above scenario, is texture cache smart enough to hold only one instance of each data-point, even though the same data-point is accessed chaotically from different blocks, that run on the same multi-processor?