Hello all.
I am developing a CUDA application where I have to evaluate a bunch of coefficients for each thread (it’s a kind of ray tracer). To do this, I have a set of control-points stored in a texture which has to be read in a deterministic manner by every thread.
I had the impression that this was a prime candidate for shared memory: Allocate a shared block and let each thread just do one texfetch, then read from the shared block across all threads. My timings however reveal that it is faster to let every thread do many texture fetches on their own. These texture fetches are deterministic and happens at the same time for every thread, so it is probably very cache friendly. Or is there some point I am missing?
Example code (simplified to get the point through, and only valid for cubic surfaces):
__device__
void
calcCoeffs( float *res, const int d, const int numtiles ) {
float u = getuv().x;
float v = getuv().y;
float bu[maxd];
float bv[maxd];
evalBernsteinBasis( u, d, &bu[0] );
evalBernsteinBasis( v, d, &bv[0] );
// *** Set up a shared block. Let each thread load one element of the texture
// *** Reading outside the texture will result in a 0 being stored.
__shared__ float4 TT[BLOCK_SIZE][BLOCK_SIZE];
// Thread index
const int tx = threadIdx.x;
const int ty = threadIdx.y;
TT[tx][ty] = texfetch( T, tx, ty );
__syncthreads();
float4 sum = make_float4( 0.0, 0.0, 0.0, 0.0 );
for ( int i = 0; i < d+1; ++i )
for ( int j = 0; j < d+1; ++j ) {
// *** Reading the texture inside the loop this way is faster. Why?
// float4 t = texfetch( T, i, j );
sum = sum + bu[i]*bv[j]*TT[i][j];
}
res[k*4+0] = sum.x;
res[k*4+1] = sum.y;
res[k*4+2] = sum.z;
res[k*4+3] = sum.w;
}