Texture question


I have a somewhat weird question. I have a float * array to which I bind a float4 texture.

The reason is best explained in the following manner:

 I need to read 4 floats for thread X, values: A, B, C, D

 Thread X+1 will read 4 floats, values: C, D, E, F   (i.e. C and D in both threads are the same)

Is there a way to do this? because from playing with it a bit, I see that if I do:

texture< float4, 1, cudaReadModeElementType > mytex;


tex1Dfetch( mytex, pos );

then pos brings me the values in pos * 4 in the array (which is reasonable I guess) but that means

I cant access positions in the array which divide evenly by 2 but not by 4.

Am I missing something ?



Not with float4, but two (cached!) float2 reads will do it nicely.

Problem is that with float2 I didnt get performance change and with float4 I got ~30% performance boost (the results were faulty though :) )

Any suggestions???

Actually when I think of it, had the tex1Dfetch with float4 didnt mul the index by 4, I could have used any indexing I’d like, i think…


I am not so texture oriented. But this is what the p.g says
template<class Type, enum cudaTextureReadMode readMode>
Type tex1D(texture<Type, 1, readMode> texRef,
float x);

tex1D function can accept a floating point “x”. So use “0.5*index” with float4… Will it not work?

tex1D() requires cudaArray() bounded region… but if that helps, you could use it.

Nope… :(

Any additional ideas guys??



I am just curious. Why will that not work?

Because, if we somehow set up: n4 = data4[n], then n4.x will always equal one of the ‘x’ values in the data set (or possibly be an interpolation between some values of ‘x’.)

What was asked for was instead a way for every second thread to do unaligned access so that they would be assigned n4.x = data4[n].z … n4.w = data4[n+1].y

Additional ideas? Mmm …

Use twice the memory - duplicating values - so that your access pattern matches the hardware?

Have the currently even threads side by side in one warp executing smoothly and the odd threads with borked access pattern in another warp?


I understand that. But we are talking about textures here.

The example u had given is for a float4 array. I am not sure how they r related.

I was talking about the “tex1D” function that takes a floating point index.

So the effect it will have is: float4_array[0], float4_array[0.5] – which will fetch from the 2nd one… No?

I dont understand much about interpolation or textures. Can you explain why this would not work? Thanks!


Brooks GPU (used by AMD Streams) has this set of “stream” functions which groups a stream.

For example if u have a stream of elements like <1,2,3,4,5,6,7,8> – you can group them as { <1,2>, <2,3>, <3,4> … }.
There are many different ways of grouping, boundary condition specification and so on…

Check out stream operator section in Brooks Spec.

You may want to group your input data before memcpy your data to the GPU…
You can use a GPU kernel to do the grouping instead of CPU (if ur data size is large)

Good Luck!

I’ve used float2 texture instead of “regular/single” float. The mapping was very trivial and easy (CUDA rocks!!)
Just doing so gave ~15-20% performance boost for my overall code.
It would still be nice to know if I could use float4 instead of float2 as I described in the first post. nVidia??? :)


Your “even threads” - if that really is where they must be - could still do their access the more straightforward float4 way if you had two texRef (one float2, one float4) to the same underlying data …

Yes, this is something I thought of. Moreover tomorrow I’ll try to do the following:

bind one float4 texture to position zero of the input array, and bind another float4 texture

to position[2] of the input and indeed access the first texture in “even” threads

and second texture in “odd” threads. I dont know whether the even/odd computation will

hurt performance or if it will work.

will post the results though :)

thanks a lot