Hello !
I was delighted by the bw_test.cu program being able to get near optimal bandwidth from the G80 GPU with a very simple kernel, i.e.:
// const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
// g_odata[idx] = tex1Dfetch(tex_float, idx); → 73GiB/s !
now i want to do the same with my slightly less simple kernel, all texture references and simply bound to cudaMalloc’ed data blocks of approx 18M floats each:
#define THREADS_PER_BLOCK 384
global void cuda_pointop(float *a, float *b, float *c, float *d, int n)
{
const unsigned int idx = THREADS_PER_BLOCK * blockIdx.x + threadIdx.x; // points
float a_in = tex1Dfetch(a_texref, idx);
float b_in = tex1Dfetch(b_texref, idx);
float c_in = tex1Dfetch(c_texref, idx);
float d_in = tex1Dfetch(d_texref, idx);
if (idx < n)
{
float a_out = a_in - b_in + c_in * d_in;
if (a_out < 0.0) a_out = 0.0f;
a[idx] = a_out;
}
}
I only get about half (37GiB/s) ?
Why ?
Can anyone help me get more bandwidth ?
Thanks!
Phil.