a question about a strange performance degradation using texturing

HI ALL

I 'm writing a CUDA algorithm that have to process data, fetching them from an array (‘D’) of float in device memory.

For a lot of reasons this data are fetched in a random order (‘index’) from memory so I bind a texture (‘tex_D_float’)

on that array to improve performace. This is the source code:

..

texture<float,1> tex_D_float;

..

cudaBindTexture(NULL, tex_D_float, cu_vector_D , VectorDim*sizeof(float));

..

__global__ kernel1 <<<....>>> (.....)

{

     float        tempFloatD;

     unsigned int index;

     ....

     [calculate "index" value]

tempFloatD=tex1Dfetch(tex_D_float,index);

     ...

     [process data]

     ....

     [output data]

     }

[number of threads=VectorDim]

With this code, a single thread are fetching only one float at a time (tempFloatD).

In a second stage, I noticed that I could improve processing algorithm by fetching

and processing two consecutive (in memory) float data (float2 type) at a time .

If I use no texture to fetch float2 data per thread, this improvement is very evident

respect fetching and processing a single float datum per thread without texture.

The strange behaviour is that the performance does not follow

this improvement…only when I use texture.

I noticed that, with a texture system to fetch data, using a single thread to fetch float2 data is worse

than using a single thread to fetch a single float datum.

This is the source code I used for fetching and processing float2 data:

..

texture<float2,1> tex_D_float2;

..

cudaBindTexture(NULL, tex_D_float2, cu_vector_D , VectorDim*sizeof(float));

..

[inside the kernel]

__global__ kernel2 <<<....>>> (.....)

{

     float2        tempFloat2_D;

     unsigned int index;

     ....

     [calculate "index" value]

tempFloat2_D=tex1Dfetch(tex_D_float2,index);

     ...

     [process data]

     ....

     [output data]

     }

[number of threads=VectorDim/2]

I tried this algorithm on devices with compute capability 1.1 and 1.3

noticed the same performace degradation using texture to fetch float2 data.

did a make a mistake in the code?? or cuda device have really this performace degradation

using texture to fetch float2 data per thread?

please …help me

thank you in advance

Rocco