Instruction Cache


I have a large application I’m porting from x86 Linux to Cuda. It has a total of 32 device functions (with about 1000 SLOC). The performance is extremely terrible.
I’ve looked at the .ptx and found a few lookup tables that I could move to constant, etc.

I’ve tried a lot of modifications that I think should improve performance, but nothing seems to have much effect. I’m starting to wonder about instruction cache, but I can’t find anything on size or algorithm or cache line size, etc.

I’m using CUDA 4.0 with a card with CC 2.1.

Can anybody help with instruction cache analysis and information.


constant memory is not a good fit for lookup tables, as accesses to different table elements get serialized. Textures would be a better fit (or just global memory, if the texture cache already has a lot of pressure).

I’m not aware that anyone has measured (and published) the instruction cache size on Fermi class GPUs yet. According to Demystifying GPU Microarchitecture through Microbenchmarking, the L1 instruction cache size on GT100 is 4kb. I wouldn’t expect the cache to be smaller on GF100/GF110. If this really is an issue, you could take their code and rerun it on your GPU to find out.