int vs int4 in cudaBindTexture

Hello,

I wrote a kernel that was reading integers from a texture and initially my texture was bound to int elements.
So, in order to fetch an element I was doing:

nextID = tex1Dfetch(texture1,currentID*257+offset);

I then changed the texture to be bound to int4 elements so my code became

int offset1 = currentID*257+offset;
int4 nextid4 = tex1Dfetch(texture1,offset1/4);
int nextitmp[4];
nextitmp[0] = nextid4.x; nextitmp[1] = nextid4.y; nextitmp[2] = nextid4.z; nextitmp[3] = nextid4.w;
nextID = nextitmp[offset1%4];

The performance of my kernel dropped 5-6 times. Are int4 elements more expensive to fetch or I am missing something?
Or the extra code I added to fetch the exact integer is responsible for low performance (i was expecting tex1dfetch to
be the performance bottleneck)?

Note : i changed int to int4 so as to support larger texture size

The problem is coming from the following lines:

int offset1 = currentID*257+offset;

int nextitmp[4];

nextitmp[0] = nextid4.x; nextitmp[1] = nextid4.y; nextitmp[2] = nextid4.z; nextitmp[3] = nextid4.w;

nextID = nextitmp[offset1%4];

Since the value of offset1 isn’t known at compile time, when you try to index into the nextitmp array, the compiler is forced to store the nextitmp in local memory. Try the following version instead:

int offset1 = currentID*257+offset;

int4 nextid4 = tex1Dfetch(texture1,offset1>>2);

if (offset1&2==0) nextID = nextid4.x;

else if (offset1&2==1) nextID = nextid4.y;

else if (offset1&2==2) nextID = nextid4.z;

else nextID = nextid4.w;

it seems that the problem is that 2 textures with element type exist at the same time.
do textures with same element type share the same channels and/or resources?
having two textures also gave me inconsistencies for the result. the same kernel
was producing different numbers in different hardware. what am I missing? :s

a side note: &2 should be replaced with &3 to get the modulo 4