I just found this out the hard way; in my kernel I need a two level hash table, so the first thing I came up with was this
__constant__ short hash_g; __constant__ short hash_h; ... return hash_h[(hash_g[b] + a) & 0x1fff];
This works very well, but was quite slow. I wondered whether
texture<short, 1, cudaReadModeElementType> hash_g; texture<short, 1, cudaReadModeElementType> hash_h; ... cudaBindTexture(0, hash_g, hash_g_gpu, sizeof(hash_g_cpu)); cudaBindTexture(0, hash_h, hash_h_gpu, sizeof(hash_h_cpu)); ... return tex1Dfetch(hash_h, (tex1Dfetch(hash_g, b) + a) & 0x1fff);
would be faster, well I’ll let the timings speak for themselves:
Constants: method=[ _Z4testPiP11permutation ] gputime=[ 60942.465 ] cputime=[ 60972.000 ] occupancy=[ 1.000 ] Texture: method=[ _Z4testPiP11permutation ] gputime=[ 29661.119 ] cputime=[ 29920.000 ] occupancy=[ 1.000 ]
The method using a texture bound to global memory is almost exactly two times as fast, given the same occupancy! Morale of the story: only use constants if the whole block addresses the same address at the same time. Otherwise a texture is faster, even for random access patterns.