Hello !

I was delighted by the bw_test.cu program being able to get near optimal bandwidth from the G80 GPU with a very simple kernel, i.e.:

// const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

// g_odata[idx] = tex1Dfetch(tex_float, idx); -> 73GiB/s !

now i want to do the same with my slightly less simple kernel, all texture references and simply bound to cudaMalloc’ed data blocks of approx 18M floats each:

#define THREADS_PER_BLOCK 384

**global** void cuda_pointop(float *a, float *b, float *c, float *d, int n)

{

const unsigned int idx = THREADS_PER_BLOCK * blockIdx.x + threadIdx.x; // points

float a_in = tex1Dfetch(a_texref, idx);

float b_in = tex1Dfetch(b_texref, idx);

float c_in = tex1Dfetch(c_texref, idx);

float d_in = tex1Dfetch(d_texref, idx);

if (idx < n)

{

float a_out = a_in - b_in + c_in * d_in;

```
if (a_out < 0.0) a_out = 0.0f;
a[idx] = a_out;
```

}

}

I only get about half (37GiB/s) ?

Why ?

Can anyone help me get more bandwidth ?

Thanks!

Phil.