I just found this out the hard way; in my kernel I need a two level hash table, so the first thing I came up with was this
__constant__ short hash_g[1024];
__constant__ short hash_h[8192];
...
return hash_h[(hash_g[b] + a) & 0x1fff];
This works very well, but was quite slow. I wondered whether
texture<short, 1, cudaReadModeElementType> hash_g;
texture<short, 1, cudaReadModeElementType> hash_h;
...
cudaBindTexture(0, hash_g, hash_g_gpu, sizeof(hash_g_cpu));
cudaBindTexture(0, hash_h, hash_h_gpu, sizeof(hash_h_cpu));
...
return tex1Dfetch(hash_h, (tex1Dfetch(hash_g, b) + a) & 0x1fff);
would be faster, well I’ll let the timings speak for themselves:
Constants: method=[ _Z4testPiP11permutation ] gputime=[ 60942.465 ] cputime=[ 60972.000 ] occupancy=[ 1.000 ]
Texture: method=[ _Z4testPiP11permutation ] gputime=[ 29661.119 ] cputime=[ 29920.000 ] occupancy=[ 1.000 ]
The method using a texture bound to global memory is almost exactly two times as fast, given the same occupancy! Morale of the story: only use constants if the whole block addresses the same address at the same time. Otherwise a texture is faster, even for random access patterns.