How big are the texture (L1 and L2) and constant caches and what are the latencies and throughputs?
what about spacial caching with the constant cache? If I have a thread block where all threads have a loop and all thread access the same element in memory for each given iteration going over the entire block of memory in an ordered manner, doing this several times, is it better to perform this in constant, texture or shared memory?
i.e a very simplified idea
for (k = 0 ; k < K ; k++)
for (y = 0 ; y < Y ; y++)
for (x = 0 ; x < X ; x++)
out += mem[y]*…;
what would be the best memory type for mem, and if it’s shared mem, what would be second best, as I may be short on shared memory for this implementation.