multiple texture fetch bandwidth

Hello !

I was delighted by the program being able to get near optimal bandwidth from the G80 GPU with a very simple kernel, i.e.:

// const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
// g_odata[idx] = tex1Dfetch(tex_float, idx); -> 73GiB/s !

now i want to do the same with my slightly less simple kernel, all texture references and simply bound to cudaMalloc’ed data blocks of approx 18M floats each:


global void cuda_pointop(float *a, float *b, float *c, float *d, int n)
const unsigned int idx = THREADS_PER_BLOCK * blockIdx.x + threadIdx.x; // points

float a_in = tex1Dfetch(a_texref, idx);
float b_in = tex1Dfetch(b_texref, idx);
float c_in = tex1Dfetch(c_texref, idx);
float d_in = tex1Dfetch(d_texref, idx);

if (idx < n)
float a_out = a_in - b_in + c_in * d_in;

if (a_out < 0.0) a_out = 0.0f;

a[idx] = a_out;


I only get about half (37GiB/s) ?

Why ?
Can anyone help me get more bandwidth ?



Pack your data into a float4 texture and read that, you should be able to attain optimal bandwidth again.

Although, I have no sufficient explanation as to why 4 float texture reads is slower (I have observed this behavior, too). It may be because there are limited texture addressing units on the hardware. If that is true, then G9x hardware should perform better in situations like this: I’ll check out of my own curiosity as soon as my G9x box is up and running.