For hash tables, use a 1D texture bound to global texture versus constants

I just found this out the hard way; in my kernel I need a two level hash table, so the first thing I came up with was this

__constant__ short hash_g[1024];

__constant__ short hash_h[8192];


return hash_h[(hash_g[b] + a) & 0x1fff];

This works very well, but was quite slow. I wondered whether

texture<short, 1, cudaReadModeElementType> hash_g;

texture<short, 1, cudaReadModeElementType> hash_h;


cudaBindTexture(0, hash_g, hash_g_gpu, sizeof(hash_g_cpu));

cudaBindTexture(0, hash_h, hash_h_gpu, sizeof(hash_h_cpu));


return tex1Dfetch(hash_h, (tex1Dfetch(hash_g, b) + a) & 0x1fff);

would be faster, well I’ll let the timings speak for themselves:

Constants: method=[ _Z4testPiP11permutation ] gputime=[ 60942.465 ] cputime=[ 60972.000 ] occupancy=[ 1.000 ]

Texture: method=[ _Z4testPiP11permutation ] gputime=[ 29661.119 ] cputime=[ 29920.000 ] occupancy=[ 1.000 ]

The method using a texture bound to global memory is almost exactly two times as fast, given the same occupancy! Morale of the story: only use constants if the whole block addresses the same address at the same time. Otherwise a texture is faster, even for random access patterns.

Yes, textures are generally faster than constants when each thread is accessing a different address.

This is because constant memory was mainly designed for storing stuff like light positions and view matrices in vertex and pixel shaders, which are typically accessed by all threads at the same address.

Thanks, that certainly makes sense. The reason I didn’t use a texture in the first place was because a long time ago, dependent texture lookups in shaders were a bad thing. But this doesn’t seem to apply to G80 anymore.

many thanks for this thread!
I’ve just replaced my hash table lookups with tex1Dfetch() and got few percent overal performance improvement!