I have a large application I’m porting from x86 Linux to Cuda. It has a total of 32 device functions (with about 1000 SLOC). The performance is extremely terrible.
I’ve looked at the .ptx and found a few lookup tables that I could move to constant, etc.
I’ve tried a lot of modifications that I think should improve performance, but nothing seems to have much effect. I’m starting to wonder about instruction cache, but I can’t find anything on size or algorithm or cache line size, etc.
I’m using CUDA 4.0 with a card with CC 2.1.
Can anybody help with instruction cache analysis and information.