a question about a strange performance degradation using texturing


I 'm writing a CUDA algorithm that have to process data, fetching them from an array (‘D’) of float in device memory.

For a lot of reasons this data are fetched in a random order (‘index’) from memory so I bind a texture (‘tex_D_float’)

on that array to improve performace. This is the source code:


texture<float,1> tex_D_float;


cudaBindTexture(NULL, tex_D_float, cu_vector_D , VectorDim*sizeof(float));


__global__ kernel1 <<<....>>> (.....)


     float        tempFloatD;

     unsigned int index;


     [calculate "index" value]



     [process data]


     [output data]


[number of threads=VectorDim]

With this code, a single thread are fetching only one float at a time (tempFloatD).

In a second stage, I noticed that I could improve processing algorithm by fetching

and processing two consecutive (in memory) float data (float2 type) at a time .

If I use no texture to fetch float2 data per thread, this improvement is very evident

respect fetching and processing a single float datum per thread without texture.

The strange behaviour is that the performance does not follow

this improvement…only when I use texture.

I noticed that, with a texture system to fetch data, using a single thread to fetch float2 data is worse

than using a single thread to fetch a single float datum.

This is the source code I used for fetching and processing float2 data:


texture<float2,1> tex_D_float2;


cudaBindTexture(NULL, tex_D_float2, cu_vector_D , VectorDim*sizeof(float));


[inside the kernel]

__global__ kernel2 <<<....>>> (.....)


     float2        tempFloat2_D;

     unsigned int index;


     [calculate "index" value]



     [process data]


     [output data]


[number of threads=VectorDim/2]

I tried this algorithm on devices with compute capability 1.1 and 1.3

noticed the same performace degradation using texture to fetch float2 data.

did a make a mistake in the code?? or cuda device have really this performace degradation

using texture to fetch float2 data per thread?

please …help me

thank you in advance
