Appendix G of the CUDA 3.1 C programming guide says
My largest kernel uses 55878 bytes of constant memory
in one large and two small (32 int each) arrays.
It runs very slowly.
I can make peformance worse by moving where the arrays are decleared.
I am using a mixture of short int and unsigned int.
I am not sure of the significance of the 8KB cache.
But am beginning to suspect that (despite lots of effort
with shared memory) the kernel is held up by random access
to off-chip memory for “constant” data as the 8KB cache is overwelmed.
On the other hand perhaps the 295 GTX does not like short int constant
As always any help, comments or hints would be most welcome