Which cache for irregular access to array of constants?

I am unsure about which cache I should aim to have an array of constants to exist in when the threads access random elements of the array (irregular access). Currently after a bit of nvvp I have identified a bottleneck as “execution dependency” in a snipit of code that looks like:

#define N 128  // Number of entries in array

__constant__ double foo[N]; // <--- KEY LINE ---|

__global__ void bar(const double *random_uniforms)  // So big it needs main memory.
  int t_id = threadIdx.x + ... ;
  double x = 0.0;

  for (int i = t_id, j=0; i < 1024; j++, i += blockDim.x)
    double u = random_uniforms[i]; // u is in [0, 1).
    double v = foo[(int) (u * N)]; // Each thread accesses a random array element!
    x += fma(v, 0.1, 0.2) // Current execution dependency bottleneck. 
  // Do something with x

int main (...)

The table of constants is very small so fitting it in cache should not be a problem, but I was surprised to see that the fma is the apparent bottleneck due to latency from reading from foo. Based on my understanding of constant arrays the read from cache will be in serial if the threads are not reading the same element. Given I expect almost all the threads in a warp to require different entries of foo, what is the best way to store this? (Entries of foo are not known at compile time, but are populated by the host based on some interim results).

Probably read through texture.

Could you elaborate (justify) the benefits/drawbacks?

How would I go about creating a read through texture? (cudaMemcpyToSymbol, cudaFree, etc?)

reading through texture is demonstrated in a variety of CUDA sample codes.

I suggest you just try it. Any justification I would offer would be hand-waving at best. It’s not difficult to implement. It’s OK if you don’t believe me or don’t want to try it.

There are plenty of discussions of texture cache usage if you care to look. Google is your friend. For instance I googled “cuda texture cache benefits” and I got this:


as the first hit. Not only does it do a better job than I would of describing it, it offers implementation help as well.

Maybe others will chime in with their suggestions or help.

More context would seem helpful. If foo is constant and random_uniform is precomputed, why is it not possible to push back the mapping to the point where random_uniform is being created right now? Instead of random_uniform you would then pass in foo_of_random_uniform, and use v = foo_of_random_uniform[i], i.e. regular access across the threads.

Textures can help if the access pattern is irregular but still has reasonable locality. Whether it helps here I would not be able to predict, the profiler will tell you. You can bind textures to existing allocations. The CUDA docs and example apps provide sufficient coverage of how to use textures.

To try and give a bit more context. This is for a Monte Carlo simulation. I use cuRand to populate the main memory with random uniforms. I then need to see which entry in the foo array these most closely map to (this might later be changed to a random integer generator and avoid the conversion, but I expect I need that for the meanwhile for a MLMC application). A second reason is that I have several blocks all processing the same uniforms. A slightly truer to life code snippet would read:

double u = random_uniforms[i]; // u is in [0, 1).
u = baz(u, blockIdx.x);  // Randomise/scramble the uniforms 
double v = foo[(int) (u * N)]; // Each thread accesses a random array element!

where each block performs a unique randomisation baz() of the same underlying uniforms (think Quasi Monte Carlo).