Can someone give me some more information about the cache access characteristics in 2.0 devices? The CUDA Guide mentions that L1 cache is shared with shared memory and fetched in 128 byte lines, and that L2 cache is multiprocessor-global and fetched in 32 byte lines. What I’m missing is information about
- [*]the size of the L2 cache[*]how many clock cycles does it take to read from L1/L2 on a cache hit?[*]Can there be bank conflicts when reading from L1? After all, it seems to be shared memory. How can I avoid conflicts if I don't have control over what data is placed where in the cache?[*]How does the volatile attribute interact with the L1 and L2 caches? Are the caches ignored entirely when accessing volatile global/host memory? Or are they snooped on changes?